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PREFACE TO THE THIRD EDITION 

TPHis book has been written primarily for those who have little or no 

previous training in mathematical statistics, but who have some training 
or experience in the presentation and handling of statistical data. It is 
consequently not written in the form of a mathematical treatise, and 
mathematical proofs have not been included. On the other hand an attempt 
has been made to cover all the modern developments of sampling theory 
which are of importance In census and survey work, and to give an adequate 
discussion of the complexities that are encountered in their practical application. 
This has necessitated fuller treatment of the subject than is to be found in 
textbooks on mathematical statistics, or than is normally included in statistical 
courses. Indeed, the orderly development imposed by the preparation of a 
book revealed a number of gaps in current theory which had to be filled in. 
Consequently the book should also prove of value to mathematical statisticians 
who are interested in sampling theory and its applications. 

The work had its origin in a request of the United Nations Sub- Commission 
on Statistical Sampling, at their first session held at Lake Success in September, 
1947, that a manual be prepared to assist in the execution of the projected 
1950 World Census of Agriculture, and the 1950 World Census of Population. 
The various computational procedures have been illustrated, as far as practicable, 
by numerical examples. These examples in the main have an agricultural 
background, since this type of data was most readily accessible and is also 
particularly relevant to the original purpose of the book. For the most part 
the data on which they are based form a small part of the results of much 
larger surveys. The examples do not in themselves serve as models for the 
reduction of large bodies of data, but once the general principles have been 
grasped no great difficulty should be found in planning this reduction, which 
presents very similar problems to those encountered in the analysis of material 
from complete censuses and surveys. 

Two new chapters (9 and 10) were included in the second edition. These 
amplify certain aspects which were not fully dealt with in the first edition and 
contain accounts of various later developments, as well as of problems arising 
in the analysis of investigational surveys. Chapter 9 is of a fairly advanced 
nature, but most of Chapter 10 will, I think, be found fairly easy reading. 

In the third edition a further chapter on the use of electronic computers in 
the analysis of censuses and surveys has been added. A computer was installed 
in the Rothamsted Statistics Department in 1954, and our experience over the 
last six years has abundantly demonstrated the value of these machines ir 
statistical work of all kinds. 



PREFACE 

I have not attempted to ascribe priority in the discovery of particular 
methods. Indeed, such a task presents almost insuperable difficulties, since 
the methods used in many surveys are not at all fully reported, and the main 
developments have arisen chiefly through ingenious practical workers devising 
new methods of selection which seemed on commonsense grounds to be capable 
of giving specially accurate results, or appeared to possess other valuable 
properties. 

The first edition has been translated into French and Japanese. The 
reception accorded to the book, and the influence it appears to have had amongst 
practical workers, have more than gratified my hope that it would serve a 
useful purpose in encouraging a wider adoption of sound techniques. The 
United Nations Sub- Commission on Statistical Sampling, to whose deliberations 
the book owes so much, was wound up after its meeting in Calcutta at the 
beginning of 1952 ; in all it held five sessions. The rapid adoption of sampling 
techniques in many parts of the world where they had hitherto been unknown, 
and the very marked improvement in survey practice, will serve as a lasting 
monument to its labours. 

For those who require a short introduction to the subject, particularly in 
its statistical aspects, I have appended a list of sections for first reading (p. xvi). 
If these are thoroughly mastered the reader should have a reasonable grasp 
of the main types of sampling and the basic statistical theory. He can then 
amplify and extend this knowledge as need arises, 

I have been much helped by my wife in the original planning of the book 
and I have had considerable assistance in the preparation of the various editions 
from members of the Rothamsted Statistical Department, particularly from 
P. R. D. Avis, Rose O. Cashen, the late P. M. Grundy, Ruth T. Hunt, G. M. 
Jolly, F. B. Leech, H. D. Patterson and H. R. Simpson. 



F. YATES 



ROTHAMSTED EXPERIMENTAL STATION, 
14th March, I960 
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SAMPLING 

CENSUSES AND SURVEYS 

CHAPTER 1 

THE PLACE OF SAMPLING IN CENSUS WORK 

1.1 The sampling process 

Sampling, that is, the selection of part of an aggregate of material to 
represent the whole aggregate, is a long-established practice. Simple examples 
are provided by a handful of grain taken from a sack, or a piece of cloth cut 
off a roll. In these cases little attention need be paid to the selection process, 
since the whole of the material is similar or well-mixed, and any part of it 
if not too small is likely to be closely representative of the whole. When, 
however, the aggregate to be sampled consists of units which are somewhat 
dissimilar amongst themselves, and which are not well-mixed, a small sample 
of these units may not be representative of the whole aggregate. Even if units 
are selected from different parts of the aggregate, and other suitable precautions 
are taken, the sample is likely to a certain extent to be unrepresentative owing 
to the chance inclusion of an undue proportion of units of a particular type. 
It will clearly not be representative if units of a particular type are chosen 
deliberately to the exclusion of other types, or if the process of selection is 
such that certain types of unit are favoured at the expense of others. Thus 
in sampling a heap of coal by taking a few shovelfuls from the edges, too great 
a proportion of the large lumps will be obtained, since the large lumps tend 
to roll down the sides and be distributed round the edges of the heap. 
Similarly in the sampling of continuous material, a single portion, even if 
quite large, may not be adequately representative ; a piece of cloth cut off 
the end of a roll in which the quality of the weaving varies progressively, will 
not form an adequate sample of the whole roll. 

Census and survey work is normally carried out on material made up of 
dissimilar units. Censuses of population, censuses of industrial production, 
and censuses of agriculture have the common feature that the aggregate of 
material embraces a large number of separate units which are often markedly 
dissimilar in various respects. In many cases the purposes for which the 
information is required are adequately served if a proportion only of the units 
are covered, but because of the dissimilarity of the different units neither 
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SECT. 1.2 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

haphazard nor casual selection, and still less deliberate selection, can _be 
expected to provide a representative sample. Rigorous processes of selection 
have therefore to be used. 

Censuses carried out on a properly selected sample will be called sample 
censuses. There has in the past been a tendency to use the term sample to 
refer to the results of an attempted complete census in which there has been 
failure to obtain information from a substantial proportion of the units. Its 
use in this sense Is strongly to be deprecated ; instead the term incomplete 
census is suggested. The term sample should be reserved for a^set of units 
or portion of an aggregate of material which has been selected in the belief 
that it will be representative of the whole aggregate. 

1.2 Sampling errors 

Whether or not a sample will give results which are sufficiently 
representative of the whole aggregate depends primarily on whether the errors 
introduced by the sampling process are sufficiently small not to invalidate 
the results for the purposes for which they are required. Even if a proper 
process of selection is employed, the sample cannot be exactly representative 
of the whole aggregate. The inevitable errors which then occur in the 
results are termed the random sampling errors of these results. The average 
magnitude of these random sampling errors will depend on the size of the 
sample, on the variability of the material, on the sampling procedure adopted, 
and on the way in which the results are calculated. 

It is a fortunate fact that if a proper process of selection is adopted, the 
average magnitude of the random sampling errors, and indeed the expected 
frequency of occurrence of errors of any magnitude, can be calculated from 
the detailed results obtained from an actual sample. The methods by which 
this can be done depend on the mathematical theory of statistical sampling. 

An extension of the analysis involved in the calculation of these errors 
enables the relative accuracy of the different sampling methods which can be 
employed on the same material to be assessed, and thus enables further surveys 
to be more efficiently planned. 

It is the development of these processes that has changed sampling from 
a speculative and uncertain procedure to a method having definite and 
determinable precision. Sampling has thus become a reliable method in which 
full confidence can be placed. In addition, the possibility of setting ascertainable 
limits to the random sampling errors has served to throw into prominence 
those other types of error which arise from faulty selection processes or faulty 
methods of observation, or which exist in some other source of information 
with which the sampling results are being compared. 

1.3 The place of sampling in census and survey work 

Sampling will only be of use in census work if, as mentioned in Section 1 . 2, 
the sampling errors are sufficiently small not to affect the validity of the results 
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THE PLACE OF SAMPLING IN CENSUS WORK SECT. 1.3 

for the purposes for which they are required. This will in part be a function 
of the degree to which the results have to be broken down. If only overall 
results for the whole population are required, a given degree of accuracy will 
be attained with a far smaller sample than will be the case if detailed results 
for different parts of the population (e.g. different regions, towns, etc.) are 
required. In certain circumstances the sample may have to be so large that 
there will be little point in using a sample census in place of a complete 
census. Obviously, in the extreme case where information on all the individual 
units is required, this can only be obtained by a complete census. 

Another factor which influences the decision whether or not to use sampling 
is the relative difficulty and cost of organizing a sample census and^a complete 
census. The amount of effort and expense required to collect information 
is always greater per unit for a sample than for a complete census. In addition 
a sample census presents its own organization problems, some of which are 
absent from a complete census, and it occasionally happens, if the information 
required is very simple, that a complete census can be carried out through the 
ordinary administrative channels, whereas a sample census requires the 
setting-up of a separate organization. Usually, however, if the size of the 
sample needed to give the required accuracy represents only a small fraction 
of the whole population, the total effort and expense required to collect the 
information by sampling methods will be very much less than that required 
for a census of the whole population. 

In many cases, therefore, sampling results in great economy of effort. 
It has also other advantages which are not so immediately apparent. In the 
first place, the completeness and accuracy of the returns may be much more 
easily ensured if the information is collected from only a small proportion 
of the population. If, for example, questionnaires are sent through the post, 
it is frequently impossible in a complete census to bring pressure to bear on 
those who fail to make their returns, even where the completion of these 
questionnaires is compulsory, owing to the large numbers of individuals 
involved. In the case of a sample, the smaller number of individuals enables 
follow-up notices to be sent and telephone calls and visits to be made. The 
separate returns can also be much more carefully scrutinized, and further 
enquiries undertaken where there is reason to doubt their accuracy. 

Secondly, it is possible to obtain more detailed information in a sample 
census. Although the burden on the individual of furnishing more detailed 
information is not lessened, except when different items of information can 
be obtained from different individuals, the individuals concerned are more 
likely to be willing to provide such information if they know that they represent 
a small sample of the whole population. Detailed information, when^ obtained, 
can be more easily handled, both at the stage of abstraction and coding of the 
original information and in the analysis of the coded results. Owing to the 
reduced volume of material that has to be handled the quality of the abstraction 
and analysis can also be improved, the former because a higher grade of clerical 
labour can be employed, with better supervision, and the latter because the 
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data can be classified in many more ways with the same amount of comparing 
or machine time. 

Thirdly, in many types of census the use of sampling makes possible 
a very considerable increase in speed, both in the execution of the field work, 
and in the analysis of the results. Speed in analysis can also be obtained^ in 
the case of a complete census, by taking a sample of the returns for abstraction 
and analysis. This device is frequently of value for providing preliminary 
results quickly, even when a final analysis of the whole of the returns is 
ultimately required. 

The use of sampling is essential for investigations of the sociological type 
in which extensive and detailed information has to be collected from individuals, 
many of whom have neither the education nor experience required to answer 
detailed questionnaires without assistance. It is equally essential in 
investigations requiring skilled physical observations and measurements. 
Such investigations can only be carried out by the use of trained investigators, 
and complete investigations covering any large group of the population or body 
of material are consequently impossible, both on grounds of expense and 
because, even if the expense can be tolerated, a sufficient body of investigators 
can rarely be recruited and trained, 

For an investigation of this kind involving the collection of elaborate 
information the term survey is usually employed. It seems a mistake, however, 
to confine the word survey to a sample survey or the word census to a complete 
census. Thus B. Seebohm Rowntree (1901), when he carried out an 
investigation into the social and economic conditions of all working-class 
families in York, correctly described this as a survey. 

Although the use of sampling necessarily introduces certain inaccuracies, 
owing to sampling errors, the results obtained by sampling are frequently 
more accurate than those obtained in a complete census or survey. ^The 
random sampling errors are always assessable. The other errors to which a 
survey is subject, such as incompleteness of returns and inaccuracy of 
information, are liable to be very much more serious in a complete census 
than in a sample census, since far more effective precautions can he taken 
to see that the information is accurate and complete in a sample census. 
Furthermore, the use of sampling greatly facilitates the imposition of additional 
more detailed checks. Indeed, a complete census can only be properly tested 
for accuracy by some form of sampling check. 

On the other hand, the claim that is sometimes made that the reliability 
and accuracy of the results of a properly planned sample census can be 
assessed with full objectivity from the results themselves is only partly true. 
The random sampling errors can be so assessed, and under certain circumstances 
it is possible to obtain comparisons between different investigators. If all 
investigators or respondents tend to make the same kind of error, however, 
this will not be revealed in the results, whether the census is complete or carried 
out on a sample. 

la respect of coverage a sample census may in certain circumstances be 
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less reliable than a complete census. It Is, for example, relatively simple for 
an investigator to ascertain by direct question whether an individual has already 
been included in a population census, and simple intensive checks of certain 
areas, say villages, can be made in a similar manner to verify that there Is no 
appreciable number of omissions. Similarly, in a survey of physical objects 
such as houses, a marking system or other suitable device can often be used 
to guard against duplication and omission. Such checks are impossible in 
the case of a sample census. 

This is one of the most difficult points in the practical design of many 
sample surveys, particularly in undeveloped areas. To overcome it complete 
enumeration can sometimes be used in conjunction with sampling. Where a 
complete enumeration of the whole of the population or aggregate of material 
presents no particular difficulty, but where the collection of detailed information 
from all units would be a difficult or impossible undertaking, a complete 
enumeration can be carried out. This is then used as a basis for the selection 
of the sample for such sample censuses and surveys as are required to provide 
the detailed information. 

1.4 Development of the use of sampling IB censuses and surveys 

Prior to the development of the appropriate methods of estimation of 
sampling errors and a clear recognition of the conditions governing satisfactory 
methods of selection of the sample, the use of sampling in census and survey 
work often proved unsatisfactory. There are many early examples of sample 
censuses and surveys which are defective in one way or another. Even when 
the basic principles of the simpler forms of sampling were understood, the 
attempted use of more complicated forms before methods of evaluating their 
errors and relative efficiency had been worked out gave rise to further defective 

surveys. 

This has led to a certain mistrust of sampling, which still exists in some 
quarters. During recent years, however, there has been a rapid growth in 
the use of sampling in various countries. This development has been greatly 
stimulated by the war and its attendant measures of large-scale economic 
control. Such measures, if they were to be effective in the changing conditions 
met with In wartime, demanded an efficient and speedy information service 
which only the sampling method could supply. This has resulted in further 
improvements in technique through the stimulation of research into the theory 
of sampling methods, and the provision of basic data for practical investigations 
of the relative efficiency of the various methods in different fields. 

It still remains true, however, that in inexperienced hands sampling may 
give unsatisfactory results, owing to the use of faulty methods of selection, 
inappropriate sampling design, or inefficient methods of estimation. The prime 
requirement of any large-scale sample survey is therefore that the organization 
of the survey should be carried out by a person who has adequate knowledge 
and experience of sampling methods and their application. The methods 
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employed must be thoroughly sound, theoretically and practically, both in 
order that satisfactory results may be ensured, and also in order that mistrust 
cannot subsequently be engendered by criticism of the methods adopted. 
It must never be forgotten that it is not sufficient to provide results which are 
in fact correct. They must also be generally accepted if they are to have their 
full value, 

It is sometimes stated that no large-scale sample census or survey should 
be carried out without the advice of an expert mathematical statistician with 
experience of such work. Unquestionably, if the services of such an expert 
can be secured this is all to the good, but my own experience is that no one 
expert can be expected to supervise adequately more than a very few surveys 
at any one time, since adequate supervision demands a very full knowledge 
both of the material that is to be surveyed and of local conditions, coupled 
with close attention to detail at all stages. An expert acting in an advisory 
capacity is therefore no substitute for the statistician on the spot, who must be 
prepared to accept responsibility for the planning, execution and analysis of 
the survey. To do this he must himself have both an adequate knowledge 
of sampling procedure and thorough knowledge of the material and local 
conditions. 

Consequently, if full and effective use is to be made of sampling methods, 
statisticians and others who already have experience of the conduct of complete 
censuses but no training in sampling methods must themselves undertake a 
study of these methods, in order that they may decide in what ways these can 
be applied to their own problems. The function of the expert then becomes 
one of advice on exceptional problems, rather than one of detailed supervision. 

Fortunately the principles underlying good sampling methods are not 
unduly difficult to understand, and provided a proper respect is observed for 
the fundamental rules of procedure I believe they can be successfully applied 
by those who have statistical experience but who are not primarily mathematical 
statisticians. 

1.5 Method of presentation 

The method of presentation adopted in this book is to take the various 
parts of the sampling process in roughly the order they are encountered in 
the execution of a census or survey, and discuss the various aspects of each 
part in turn. Thus Chapters 2 and 3 describe the various types of sample 
that can be used, and the general principles to be followed in the selection 
of a sample, Chapter 4 deals with the practical planning of a survey, and 
Chapter 5 with the problems encountered in its execution and in the abstraction 
of the results. The remaining chapters are concerned with the more strictly 
statistical problems. Chapter 6 deals with the various methods of estimating 
the population values, Chapter 7 with the estimation of sampling errors, and 
Chapter 8 with the determination of the relative efficiency of the various 
sampling methods. This method of presentation has the advantage that the 
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more practical aspects of sampling procedure are dealt with first. It is true 
that knowledge of the statistical techniques described in the later chapters is 
necessary before the relative merits of different methods of sampling any 
particular type of material can be accurately assessed. The detailed application 
of these techniques, however, is the province of those responsible for the 
numerical analysis of the results, whereas the planning is also the concern 
of those who require the information and those who are concerned with its 
collection. The planning can be undertaken much more efficiently, and with 
added interest, if all concerned understand in general terms the underlying 
problems. It is hoped that study of the first five chapters will give this under- 
standing. If they also act as a stimulus to the study of Chapter 6 and the 
first few sections of Chapter 7 the understanding should be correspondingly 
deepened. 

For those responsible for the numerical analysis of the results, and for 
the assessment of the relative efficiency of the different possible methods, 
thorough study of the whole book is necessary. This study should include 
the reworking of the numerical examples. Only by this procedure can a 
thorough grasp of the details of the various methods be obtained. 

The separation of the discussion of the methods of estimation of the 
population values, of the sampling errors, and of efficiency necessarily involves 
a good deal of cross-reference, particularly in the numerical examples. Since 
this appeared inevitable, it was with some hesitation that the chosen method 
of presentation was adopted. On balance, however, this disadvantage appeared 
to be outweighed by the advantage of being able to present as a whole the 
relatively simple techniques involved in estimation before the more complicated 
techniques required for the estimation of error and the assessment of relative 
efficiency. It is believed that this will make the book more useful to those 
who do not require to go deeply into these latter techniques. For those who 
prefer it, there is nothing to prevent the simultaneous study of the corresponding 
sections of Chapters 6 and 7, or indeed of Chapters 6, 7 and 8. Chapters 6, 
7 and 8 may also, if desired, be taken before Chapters 4 and 5. 

1.6 Terminology and notation 

The question of terminology was considered by the United Nations Sub- 
Commission on Statistical Sampling, at its second session held in Geneva in 
September, 1948. Their recommendations are included in a memorandum 
entitled Recommendations concerning the Preparation of Reports on Sampling 
Surveys. With a few minor exceptions the terminology adopted in this book 
is that recommended by the Sub-Commission. 

New conventions have been adopted for the mathematical notation. The 
use of bold face and Gill Sans type for population values and their estimates, 
and of capital letters for the population totals, has enabled the formulae to be 
presented in a very simple, and it is hoped easily understandable form. By 
the use of this notation the elaborate summation notation which has become 
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current in much of the literature on sampling has been avoided. It is 
recognized that the notation is not particularly convenient for manuscript 
and typescript, but the difficulty can in fact be overcome by the use of single 
and double underlining, or wavy and straight underlining. In verbal description 
the words " population " and " estimate/* or abbreviations of them, can be used. 



CHAPTER 2 

REQUIREMENTS OF A GOOD SAMPLE 

2*1 Bias 

The principal object of any sampling procedure is to secure a sample 
which, subject to limitations of size, will reproduce the characteristics of the 
population, especially those of immediate interest, as closely as possible. 

At first sight it might appear that the most accurate results could be 
obtained by deliberate selection of the units to be included in the sample. 
In particular, if averages only are of interest, units might be selected which 
appear to be nearest to the average. If, for example, a quick assessment of 
the yield per acre of an agricultural crop is required, district officers might 
be asked to select some " average " fields in each district, and to determine 
the yields of these fields. 

Such a sample is unfortunately very often of little value. Its primary 
fault is that it may well be biased, that is, the selection of all the fields may 
be affected by similar errors. Thus, in order to enhance the reputation of 
their districts, all district officers may tend to select fields which yield more 
heavily than the average, or, if they feel that the interests of the farmers or the 
country may be furthered by an underestimate, they may select fields which 
yield less than the average. 

Even if the district officers can be trusted to be completely objective, 
considerable unconscious errors of judgment, all tending in the same direction, 
may still occur, and such errors may far outweigh any increase in accuracy 
resulting from deliberate selection. Nor will increase in the number of officers 
concerned in the selection necessarily improve matters, since all may be subject 
to the same type of error. 

We may consequently distinguish between two types of sampling error, 
those arising from biases in selection, etc., and those due to chance differences 
between the members of the population included in the sample and those 
not included. The aggregate of the former in the sample will be termed the 
error due to bias and the aggregate of the latter the random sampling error, or 
when bias is known to be absent, the sampling error. The total sampling error 
will, of course, be made up of the bias, if any exists, and the random sampling 
error. The essence of bias is that it forms a constant component of error 
which does not decrease, in a large population, as the number in the sample 
increases, whereas the random sampling error decreases on the average as 
the number in the sample increases. 

2,2 Methods of selection which give rise to bias 

There are a number of ways in which faulty selection of the sample may 
give rise to bias. The main causes may be broadly classified as follows : 
(1) Deliberate selection of a " representative " sample. This is the type 
of bias described above. 
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(2) A procedure of selection depending on some characteristic which is 

correlated with properties of the unit which are of interest. Many 
haphazard selection processes give rise to biases of this kind. 

(3) Conscious or unconscious bias in the selection of a " random " sample. 

If a proper random process is not strictly adhered to, the investigator, 
although claiming that his sample is random, may allow his desire 
to obtain a certain result to influence his selection. This type of bias 
is particularly serious, since its existence may not be immediately 
apparent. 

(4) Substitution. Investigators often substitute another convenient member 

of the population when difficulties are encountered in obtaining 
information. Thus, in a house-to-house survey the next house 
may be taken when there is no reply. This will necessarily lead to 
a preponderance of houses of the type that are occupied all day, 
e.g. houses of people with families. 

(5) Failure to cover the whole of the chosen sample. If no second visit 

is made to houses from which no reply is received there will still 
be bias even though no substitution is attempted. This fault is 
particularly prevalent in postal questionnaires, which are often very 
incompletely returned. Returns are clearly likely to be received 
from individuals who are specially interested in the objects of the 
survey, or possess other chaz*acteristics which make them 
unrepresentative of the whole population. 

2.3 Avoidance of bias in selection 

It is clear that, if possibilities of bias exist, no fully objective conclusions 
can be drawn from a sample. The first essential of any sampling procedure 
must therefore be the elimination of all important sources of bias. 

The simplest, and the only universally 'certain way, of avoiding bias in 
the selection process is for the sample to be drawn either entirely at random, 
or at random subject to restrictions which, while improving the accuracy, 
are of such a nature that they do not introduce bias into the results. In some 
cases, however, certain forms of systematic selection, such as the selection of 
names at equal intervals down a list, or the use of an evenly spaced grid of 
points on a map, may be permissible. 

Random selection does not mean haphazard selection. A random sample 
can only be obtained by adherence to some proper random process, such as 
the drawing of lots or the use of a table of random numbers. Sticking pins 
into a map will not give a random distribution of points in a map. The 
selection of houses by walking through the streets of a town will not give 
a random selection of houses in the town. The words *' random " and 
" random sample " are, in fact, gravely abused. For this reason, if for no 
other, the method of selecting the sample should be specified in all accounts 
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of the results of sample surveys and censuses, and Indeed In all sampling work. 
In order to prevent careless or deliberately biased selection on the part 
of investigators it is often important In large-scale work for the selection to 
be done in some central office, In such a manner that no element of choice is 
left to the investigators, and In such a manner also that checks on the field 
work can be imposed if necessary. Even in cases in which a less rigorous 
method of selection may be judged to be satisfactory, it may be necessary 
to impose a rigorous method in order to prevent criticism on this ground by 
those not familiar with the details of the work. 



2.4 Examples of biased selection 

It may be well at this stage to give some actual examples of cases in which 
an unsatisfactory method of selection has introduced serious bias into the 
results. 

The first example is taken from a paper by Kiser (1934, D). A sample of 
households was taken in Syracuse, U.S.A., in 1930 and 1931, with the object 
of making a study of morbidity. It was also intended to use this sample for 
the study of birth-rates. Before beginning this latter study, which was 
subsidiary to the morbidity study, a comparison was made of the sizes of 
households of the sample with those of the corresponding census tracts. This 
comparison is shown in Table 2. 4. a. (Households of one were not Included 
in the survey.) 

TABLE 2. 4. a SAMPLE OF HOUSEHOLDS IN SYRACUSE: DISTRIBUTION OF 

HOUSEHOLDS ACCORDING TO SIZE, IN THE ORIGINAL SAMPLE, AND IN THE 

CENSUS TRACTS 





Original 


sample 


Census 


tracts 


Number in 










household 












Number 


Per cent. 


Number 


Per cent. 


2 


254 


19-4 


1,762 


26-8 


3 


338 


25-9 


1,745 


26-5 


4 


307 


23-5 


1,438 


21-9 


5 


201 


15-4 


853 


13-0 


6 


106 


8-1 


388 


5-9 


7 


46 


3*5 


208 


3-2 


8 


25 


1-9 


96 


1-5 


9 and over 


29 


2-2 


86 


1-3 


TOTAL . 


1,306 


99-9 


6,576 


100-1 



It is immediately apparent from the table that the sample contains a 
considerably greater proportion of large households than exist in the whole 
population. Households of two are under-represented in the sample to the 
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extent of 7-4 per cent, of all households, or 28 per cent, of the households of 
this size. This deficiency is attributed by Kiscr to the failure of enumerators 
to revisit missed households, in which childless married women working away 
from home are likely to predominate. In order to provide a more satisfactory 
sample it was necessary to make a further survey of those families that were 
missed altogether at the time of the morbidity survey. 

It is interesting to note that the sample was apparently considered 
satisfactory for the morbidity study, as is indicated by the statement that the 
workers " had been primarily concerned with securing a sample representative 
of the area in regard to prevalence of sickness rather than size of household." 
Actually such a biased sample can scarcely be regarded as wholly satisfactory 
for even a morbidity study, since sickness rates are likely to vary with the 
size and composition of the family. 

The second example is one obtained at Rothamstcd in an experimental 
sampling of a collection of stones (Yates, 1936, i, H). The stones, a number of 
flints of varying sizes, some 1 200 in all, were spread out on a table, and twelve 
observers were each instructed to choose three samples of twenty stones which 
should represent as nearly as possible the size distribution of the whole 
collection. Table 2.4.b gives the mean weights per stone of these 36 samples, 
and also the true mean weight of the whole collection. 

TABLE 2.4:.b MEAN WEIGHT PER STONE IN SAMPLES OF 20 STONES (oz) 

Observer 1 2 3 4 5 6 7 8 9 10 11 It 

Sample 1 1-9 24 2-4 1*9 2-2 2-8 2-4 1-6 2-2 2-6 2-4 2-4 

Sample 2 1-8 3-0 2-4 2-0 2-7 2-6 2*6 2-0 2-2 2-2 2*4 3-0 

Sample 3 1*7 2-4 2*1 2*0 3-1 2-8 2-fi 2-0 2*2 3*1 1*8 2*4 



Mean 1-8 2-0 2*3 2-0 2*7 2-7 2-5 1-9 2-2 2*6 2*2 2-6 



Mean of all samples : 2-34 02. True mean : 1*91 oz. 

It is apparent that there is a tendency, which is common to most observers, 
to select stones which are on the average larger than those of the whole 
collection. Of the twelve observers ten chose samples whose mean weight 
was above the mean weight, 1-91 oz., of all the stones, the mean for all samples 
being 2-34 oz. This tendency is consistent from sample to sample. Thus, 
of the thirty samples chosen by the above ten observers, all but two had mean 
weights greater than the mean weight of all stones, while all three samples of 
observer 1 were less than the correct mean. 

In this example the selection was deliberate, A further example showing 
similar effects arising from haphazard selection (claimed by the observer to 
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be " random >s ) is provided by some observations obtained in the course of 
a scheme of sampling observations on the growth of wheat Instituted by the 
Agricultural Meteorological Committee (Yates, 1935, A). 

In this scheme measurements on the heights of shoots of wheat were made 
at regular intervals on observation plots at a number of centres. A detailed 
procedure had been laid down for the random location on each occasion of 
1 28 quarter-metre lengths of row in sets of 4 on contiguous rows. The height 
measurements were made on the 256 shoots at the ends of these lengths test 
observations conducted at another time indicated that this method of selection 
was virtually random. At one centre a drill with fewer rows than normal 
had to be used, and as a result only 1 92 shoots were available for measurement 
on each occasion. In order to provide the number of observations laid down 
and thereby, as he thought, improve his results, the observer selected " at 
random " two additional shoots from each set of three quarter-metre 
lengths. Fortunately he booked the observations on these additional shoots 
separately. 

Figures 2.4.a and 2.4.b show the distribution of the regular and additional 
measurements taken on the 31st May and on the 28th June respectively. The 
deviations from the set means of the regular measurements are shown. 
Suitable adjustments, details of which are given in the original paper, are 
made to the additional measurements to give fair representation of the 
variability as well as the bias in the mean. 

Examination of Figure 2. 4. a indicates that on this date the additional 
measurements show a considerable preponderance of positive deviations with 
a corresponding deficiency of negative deviations. There is, in fact, a tendency 
to select shoots which are higher on the average than those of a truly random 
sample, the difference in the average height being + 3-3 cm. This difference 
is clearly in the nature of a bias, and cannot be attributed to random sampling 
errors. 

The situation was entirely different on the 28th June, as is shown in 
Figure 2.4.b. At this date the deviations of the additional measurements, 
both positive and negative, are smaller on the average than are those of the 
regular observations ; in other words, there is a tendency to select shoots 
which are nearer the mean height than they would be on the average in a truly 
random sample. In spite of this, there is again a considerable bias, this time 
negative, the mean difference being 2-7 cm. In this case, therefore, a single 
additional shoot will give a value which on the average is closer to the true 
mean value than is the value given by a single randomly located shoot, but 
as the number of shoots is increased the relative accuracy of the random sample 
progressively increases, and with the numbers of shoots actually taken, the 
random sample is considerably more accurate. 

This example provides an illustration of a case where the biases on the 
two occasions, though arising from similar defects in selection, are of very 
different magnitude, and indeed of opposite sign. Consequently the difference 
of the two sets of measurements will also be seriously affected by bias. In 
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SCALE OF DEVIATIONS FOR REGULAR OBSERVATIONS (cm) 

FIG. 2. 4. a DISTRIBUTION OF REGULAR OBSERVATIONS (shaded) AND ADDITIONAL 

OBSERVATIONS (uns haded) OF HEIGHTS OF WHEAT SHOOTS ON 31st MAY 
(By courtesy of the editor of the Annals of Eugenics.) 




SCALE OF DEVIATIONS FOR REGULAR OBSERVATIONS (cm) 

FIG. 2.4.b DISTRIBUTION OF REGULAR OBSERVATIONS (shaded) AND ADDITIONAL 
OBSERVATIONS (unshaded) OF HEIGHTS OF WHEAT SHOOTS ON 28th JUNE 

(By courtesy of the editor of the Annals of Eugenics.) 
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this case the growth rate of the wheat would have been underestimated by 
nearly 10 per cent, had only the additional measurements been available. 

These biases are, of course, of the type that might be expected. When the 
shoots are only half-grown and there is nothing much to be seen except the 
top leaves there will be a tendency to pick the longer shoots, but when the 
crop has come into ear the observer can see shoots of all lengths, and Is more 
likely to select shoots somewhere near the average, omitting both very long 
and very short shoots. The strong negative bias of the last set of measurements 
shows that this selection was not particularly effective in improving the 
accuracy of the sample. 



2.5 Bias arising from faulty demarcation of the sampling units 

Any consistent errors in measurement will clearly give rise to bias, 
whether the measurements are carried out on a sample or on all the units of 
the population. The danger of such errors is, however, likely to be greater 
in sampling work, since the units measured are often smaller. Furthermore, 
the knowledge that had another sampling unit by chance been selected a very 
different value might have been obtained, may lead the inexperienced worker 
to believe that accuracy in the measurement of the selected units is of little 
Importance. 

When the sampling units are not natural units of the population, the selected 
units usually have to be demarcated at the time the measurements are taken. 
In crop sampling work in particular, where small areas are selected in order 
to obtain an estimate of the yield or other characteristics of the crop, location 
of the areas by means of randomly selected co-ordinates, though theoretically 
ensuring a random sample, will only in practice do so if the field work is carried 
out with complete objectivity. Since it is impossible in practice to locate the 
areas according to their co-ordinates by means of exact measurements, pacing 
or some similar approximate method must be used. 

In this type of work the areas themselves should not be too small, both 
because errors in the demarcation of the boundaries become of increasing 
importance as the size of the unit is decreased, and also because the possibility 
of influencing the results by small changes in location, e.g. so as to include 
a particularly good plant, is greater the smaller the unit areas. Very small 
areas are capable of giving completely reliable results with experienced and 
well trained field-men, but may be very unreliable when used by inexperienced 
workers, particularly if the need for complete objectivity is not appreciated. 

Sukhatme (1946#, H), for example, has reported the biases shown in 
Table 2.5 in some trial crop-sampling work on wheat. He himself expresses 
the opinion that the biases of the very small areas are due to the inclusion of 
border plants. This, however, would imply that the effective radius of the 
smallest areas, which were nominally circles of 2 ft. radius, would have to be 
increased by nearly 5 inches. Errors of this magnitude appear improbable, 
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unless the observers were very careless in their work, and It seems likely that 
part at least of the bias has been caused by faulty location. 

Eye estimates are themselves a form of measurement, but such estimates 
are always subject to bias, which is likely to vary from observer to observer, 
and is often very substantial. If eye estimates are used, steps must therefore 
be taken to eliminate the resultant biases by carrying out proper measurements 
on a sub-sample of the material. A simple example of this is provided by the 
1938-9 Census of Woodlands described in Section 4.25 and Examples 6.12.b 
and 7.11. A more complicated example is discussed in Sections 6.15 and 7.14. 



TABLE 2.5 BIAS IN THE USE OF SMALL-SIZE AREAS IN SAMPLE SURVEYS FOR 

YIELD (Sukhatme) 



Size of area 
in sq. ft. 


No, of 
areas 


Average yield in 
maunds per acre 


Percentage 
overestimation 


Irrigated 








471-5 


78 


10-10 


__ 


117-9 


78 


10-58 


4*8 


29-5 


78 


11-69 


15-7 


28-3 


117 


11-60 


14-9 


12-0 


117 


14-38 


42-4 


Unirrigated 








471-5 


107 


6-55 


. 


117-9 


107 


7-27 


11-0 


29-5 


107 


8-08 


23*4 


28-3 


102 


7-52 


14-8 


12-6 


161 


9-33 


42*4 



2.6 Bias in estimation 

In addition to biases which arise from faulty processes of selection and 
faulty work during the collection of the information, faulty methods of 
analysing the results may also introduce bias. A simple example occurs in 
the estimation of ratios. If, for instance, an agricultural crop is grown on 
types of land with different levels of fertility, and if the fields on the different 
types of land are of different average size, the mean yield per acre estimated 
from the mean of the yields per acre of all the fields may be markedly different 
from the mean yield per acre of all the land growing the crop. To take a 
numerical example, if there are three types of land having average yields of 
20 cwt., 15 cwt. and 10 cwt. per acre respectively, and fields of an average 
size of 5 acres, 10 acres and 15 acres respectively, the number of fields on each 
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type of land being the same, the mean yield per acre over the whole of the 
land will be given by the weighted mean 

5 X 20 + 10 X 15 + 15 X 10 

5 + 10+15 13* cwt per acre 

whereas the mean of the yields per acre of all fields will be 15 cwt. Consequently 
the bias in the estimate by the latter process will be about 12 per cent. 

Biased estimates can be avoided by using the proper processes of estimation. 
This matter will be dealt with more fully in Chapter 6. * 

2.7 Circumstances in which bias is permissible 

Although avoidance of any substantial bias is usually of the utmost 
importance, particularly in censuses on which administrative action has to be 
based, absence of bias is not always essential. In some types of investigation 
a certain amount of bias, provided it is reasonably constant, can be accepted. 
In censuses which are repeated at frequent intervals with a view to determining 
the changes rather than absolute values, for instance, a small overall bias may 
be of little consequence, provided it is constant in time. Similarly in surveys 
which have as their main objective the comparison of different groups of the 
population a bias which is approximately constant from group to group will 
be of little importance. The investigator must also avoid attaching exaggerated 
importance to minor sources of bias which, in fact, can only produce errors 
which are trivial relative to the random sampling error. 

2.8 Methods of reducing the random sampling error 

Once the absence of any important bias has been ensured, attention can be 
turned to the random sampling errors. These must clearly be sufficiently 
small to achieve the accuracy required. 

Apart from errors due to bias, the simplest way of increasing the accuracy 
of the sample is to increase its size. Other things being equal, the random 
sampling error is approximately inversely proportional to the square root of 
the number of units included in the sample. 

The accuracy attained will, however, depend not only on the number of 
units included in the sample, but also on the variability per unit ; or, more 
strictly, on that part of the variability per unit which contributes to the 
sampling error. It is here that the complications of sampling procedure, both 
of design and of subsequent analysis, arise. By suitable processes of selection, 
which while imposing restrictions on fully random selection do not introduce 
bias into the results, the part of the variability per unit which contributes to 
the sampling error can often be substantially reduced, and the size of the 
sample required for a given accuracy thereby diminished. 

The simplest type of restriction is that known as stratification. The 
population is " stratified " or divided into blocks of units in such a manner 

* See also Sections 10.6 and 10,7. 17 
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that the units in each stratum or block arc as similar as possible. Each of 
the strata is then sampled at random. If the same proportion is taken from 
each stratum, It is clear that each stratum will be represented in the correct 
proportion in the sample, and consequently differences between different strata 
are eliminated from the sampling error. 

In addition to stratification there are a number of other devices, which 
will be discussed in more detail later, by which the accuracy of the sampling 
procedure can be increased, often very substantially. The three most important 
are : utilization of supplementary information, use of a variable sampling 
fraction (sometimes called " optimal allocation ") and multi-stage sampling. 

Utilization of supplementary information, that is information which is derived 
from sources other than the sampling scheme, or from a more extended sample 
than that on which information on the main characters is collected, takes a 
number of forms. A simple example will illustrate the general principle. 
Suppose that an estimate of the wheat yield of a country is required, and that 
a random sample of wheat fields has been taken and the total yield of each 
field determined. We can then estimate the total wheat yield of the country, 
either (a) by multiplying the total yield of the sample by the reciprocal of the 
proportion of the fields included in the sample, or (V) by calculating the mean 
yield per acre of the sampled fields (by dividing the total yield of all the sampled 
fields by their total area, so as to avoid bias) and multiplying this mean yield 
by the total acreage, of wheat in the country. The latter estimate can only be 
made if the total acreage of wheat in the country is already known with sufficient 
accuracy, e.g. from returns made by the farmers or from a larger Cample. 
If this information is available the second estimate is likely to be considerably 
more precise than is the first, since the variability of the total yields, which 
in so far as the yield per acre is constant will be proportional to the areas of 
the individual fields, is likely to be considerably greater than the variability 
of the yields per acre of the individual fields. 

The use of a variable sampling fraction, i.e. the inclusion of different 
proportions of the different strata in the sample, enables the more important, 
or more variable, parts of the population to be sampled more intensively. 
If this is done it will of course be necessary to weight the contributions of the 
different strata to the total in the correct proportions. 

The optimal sampling fractions depend on the relative variability of the 
different strata into which the population is divided for the purpose of taking 
a sample. Thus, if it is required to determine the number of workers in a 
given industry, it will be better to take a much larger fraction, possibly all, 
of the large factories than of the smaller factories. 

In multi-stage sampling the population is divided into a number of first-stage 
sampling units, which are sampled in the ordinary manner, the selected 
first-stage units being subdivided into smaller second-stage units, which are 
also sampled. Further stages can be added if required. Thus, for example, 
in a population survey, a sample of all towns and villages may be taken, and 
in each of the selected towns and villages a sub-sample of all households may 
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be taken, with, possibly, for certain purposes, a further sub-sample of individuals 
from the selected households. 

2.9 Choice of unit 

In some classes of material there is considerable choice in the type and size 
of sampling units, and this gives further scope for increase in the efficiency 
of the sampling procedure. In general, when a given proportion of the material 
is included in the sample, the smaller the sampling units employed, the more 
accurate and representative will be the results. Thus, for example, in an 
agricultural survey, it will be more accurate to take 10 per cent, of all farms in 
each parish, or other small administrative unit, than to take all the farms in 
10 per cent, of the parishes. This will remain true even if multi-stage 
sampling is adopted. It will be more accurate, for example, to take 10 per cent, 
of all the parishes in each county, with a second-stage sampling of the farms 
of each selected parish, rather than to take all the parishes in 10 per cent, of 
the counties, with the same degree of second-stage sampling. The reason 
for this is fairly obvious. The parishes in any county are likely to be more alike 
than are those of different counties, and if counties are used as sampling units, 
all the parishes in a county will be included or excluded from the sample 
simultaneously. 

This need for small units distributed over the whole of the population 
often conflicts with the administrative requirements. It is clearly easier to 
arrange for a survey of farms in compact areas, such as parishes or counties, 
than to have to survey the same number of farms scattered over the whole 
country. The choice of a suitable balance between these two conflicting 
requirements is often one of the main problems in the planning of a sample 
survey. Furthermore, if only a small number of large units are included in 
the sample, whether or not there is second-stage sampling of these units, the 
sampling error will not be well- determined, since there will be relatively few 
differences between units on which to base the estimate of this. 

We see, therefore, that the choice of sampling method depends not only 
on the relative accuracy of the different methods, but also on their practical 
convenience. It is important, for example, that the process of selection of the 
sample should not involve excessive preliminary work in the form of mapping, 
etc. The most suitable sampling method will therefore depend very much 
on the type of information that is already available on the population to be 
sampled ; a method which may be excellent for a country where good maps 
are available may be entirely useless in a country which is inadequately mapped. 
Again, it is important that not only should the collection of the information 
not involve excessive travelling, but also it should be possible to subject the 
field-workers to proper supervision : consequently, sampling procedures which 
may be excellent with postal questionnaires may be entirely unsatisfactory 
when the information is collected by special investigators. 
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CHAPTER 3 
THE STRUCTURE OF VARIOUS TYPES OF SAMPLE 

3.1 Definition of frame and sampling unit 

In this chapter we propose to give a technical description of the structure 
of the various types of sample which are most commonly employed in practice, 
arid the methods which must be followed in selecting them. The methods of 
obtaining estimates of the population values and of the sampling errors from 
the sample values will be discussed in Chapters 6 and 7, 

All rigorous sampling demands a subdivision of the material to be sampled 
into units, termed sampling units, which form the basis of the actual sampling 
procedure. These units may be natural units of the material, such as individuals 
in a human population, or natural aggregates of such units, such as households, 
or they may be artificial units, such as rectangular areas on a map, bearing no 
relation to the natural subdivisions of the material. 

It is not always necessary to make an actual subdivision of the whole of 
the material before selection of the sample, provided the selected units can be 
clearly and unambiguously defined. Thus, with sampling units which are 
rectangular areas on a map there is no need to demarcate all these areas ; they 
can be defined by co-ordinates, and the selected areas demarcated after selection. 

Clear and unambiguous definition demands the existence or construction 
of some form of frame. In the sampling of a human population, for instance, 
with households as sampling units, there must be available a list of all house- 
holds, and this list must be such that any household selected from it can be 
unambiguously located. In area sampling from maps, the maps must be such 
that the selected areas can be unambiguously defined on the ground. 

The specification of the frame implicitly defines the geographical scope of the 
survey and the categories of material covered. A survey of a human population 
based on a list of households, for instance, will only cover those categories 
of the population which constitute the households included in the list, If 
other categories require inclusion, or if the frame is defective, special steps 
will have to be taken to supplement and emend it. 

In statistical terminology any aggregate of values is termed a population, 
and consequently the whole aggregate of sampling units into which the material 
is divided is known as the population of sampling units. If the sampling units 
are aggregates of the natural units of the material, these natural units will form 
a further population which must be distinguished from the population of 
sampling units. 

In America the term cluster sampling has been applied to sampling in which 
the sampling units are aggregates or " clusters " of the natural units. The 
term is a somewhat loose one, since there is often a hierarchy of natural units, 
e,g a sample in which the sampling units are households may be regarded as 
an ordinary sample of households or as a cluster sample of individuals. 
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In multi-stage sampling there is also a hierarchy of sampling units, first- 
stage, second-stage, etc., corresponding to the different sampling stages, and 
each set of units will form its own population of units. 

Sampling units may be of the same or differing size. They may contain 
the same, or approximately the same, number of natural units, or they may 
contain widely differing numbers. The whole procedure of sampling, including 
the estimation of the population values and the sampling errors, is simplest 
when the sampling units are of approximately the same size and contain 
approximately the same number of natural units. Often, however, the material 
is such that this condition cannot be conveniently fulfilled. In particular, if 
the natural units are themselves of widely differing size, variation in size of 
the sampling units or in the number of natural units they contain is inevitable. 

There is nothing in the sampling process which demands that the sampling 
units should be of any particular size, but, as has been explained in Section 2.9, 
the smaller the sampling units employed the more accurate will be the 
results obtained when a given proportion of the material is included in the 
sample. 

3.2 Random sample 

A random sample is the simplest type of rigorously selected sample, and 
is the basis of most of the more complicated sampling methods. In a random 
sample, after subdivision of the material into sampling units, the requisite 
number of units are selected at random from the whole population of units. 

As has been emphasized in Section 2.3, random selection implies a strict 
process of selection equivalent to that of drawing lots. In practice it may 
be carried out either by some such process, or preferably, since adequate 
shuffling of cards, etc., is difficult, by the use of a table of random numbers. 
A small table of random numbers is given at the end of the book. The examples 
of this section illustrate the use of such a table. 

The process of random selection may proceed in two stages. Suppose 
that <the population is divided into groups of units containing x lt # 2 > x & x n 
units. The successive sub-totals 

*1 = -Xl x l + X * = ^2> #1 + X 2 + X 3 = %& * #1 + *2 + + X n = -Xil 

are first calculated, which is easily done on a printing adding machine. The 
requisite number of numbers are then selected at random between 1 and X n , 
numbers that occur more than once being rejected. A selected number that 
is greater than .Xi-i, but less than or equal to X s , indicates that a unit of the 
sth group is to be taken. Selection of a unit at random from this group, which 
can if convenient be made on the basis of the number already selected, will 
then give the equivalent of completely random selection. 

This two-stage process is of value when the full numbering or demarcation 
of the units in all groups before sampling is laborious, since only the total 
number of units in each group need be known. It is of particular value when 
the units are artificially demarcated areas, and the total areas of natural 
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subdivisions of the material are known. By using the process only the units 
in the selected groups have to be numbered or demarcated. 

Example 3.2. a 

Select a sample of 20 from a population of 2879 units. 

Using the four-figure numbers given by the first four columns of digits 
in Table A, 1, and rejecting all numbers greater than 2879, we obtain the 
sample 347, 1676, 1256, 1622, 1818, 2662, 2342, 1608, 2742, 39, 1690, 1127, 
1490, 2046, 526, 797, 2699, 1465, 2467, 1753. 

The above procedure results in the rejection, in this example, of nearly 
three-quarters of the random numbers given by the table. Various devices 
may be used to avoid this. In the present example the simplest is to take 
the numbers 3001-6000 and 6001-9000 as equivalent to 0001-3000, rejecting 
the numbers 9001-9999 and 0000, Using the second column of four-figure 
numbers gives the sample 1373, 2467, 227, 2599, 2635, 1794, 1753, 378, 1234, 
2632, 792, 897, 1064, 2819, 1712, 1837, 2722, 1504, 13, 2565. 

If with either of the above procedures the same unit is selected a second 
time, the number leading to this selection is rejected, and an additional 
number taken. 

It will be noted that neither of these samples is evenly distributed over 
the whole range of units. The distribution between the different thirds of 
the range is in fact : 

Numbers 1st sample 2nd sample 

1-960 4 5 

961-1920 .... 10 8 

1921-2879 ... * 7 

20 20 

Random selection will give samples that deviate somewhat from an even 
distribution, the actual deviations being themselves governed by statistical 
laws. Exact statistical tests show that about three out of four samples will have 
smaller aggregate deviations than the first sample, but only three out of ten 
will have smaller aggregate deviations than the second sample.* 

Example 3.2.b 

Select unit areas fa mile x fV mile at random from a rectangular area 
5 miles x 4 miles. 

There are 2000 unit areas, which can best be defined by co-ordinates 1-50 
along the longer side of the rectangle, and 1-40 along the shorter side, the 

* The appropriate test is that known as the X* test. A description of this test 
will be found in most modern statistical textbooks. 

22 



STRUCTURE OF VARIOUS TYPES OF SAMPLE SECT. 3.3 

co-ordinates selected defining the corner of the unit area furthest from the 
comer of the rectangle (0, 0). The selection of a number at random between 
1 and 50, and a second between 1 and 40, will therefore select a unit at random. 
Taking the third column of four-figure numbers (beginning 8636) and following 
the second of the procedures of Example 3 . 2 . a gives the pairs of co-ordinates 
36, 36 ; 12, 02 ; 16, 16 ; 14, 38 ; etc. 

If points instead of unit areas are to be selected each co-ordinate range 
should theoretically be infinitely subdivided. The actual degree of subdivision 
need not usually be very fine. 

The procedure of this example may be used for the selection of unit areas 
or points from an irregularly shaped area, provided the extreme range of 
each co-ordinate is included, points falling outside the area being rejected. 
More elaborate processes, involving less rejection, can of course be devised, 
but care must be taken that the probability of selection of all areas or points 
is equal. Thus in a triangular area, the selection of lines parallel to the base 
at random distances from the base, followed by the selection at random of 
a point within the triangle on each of the selected lines, will give a greater 
density of points near the apex of the triangle. The selection of points within 
a circle by the selection of random distances and bearings from the centre 
will give a greater density of points near the centre. In irregularly shaped 
areas, also, fractional unit areas requiring special treatment will occur at the 
boundaries. 



Example 3 .2 .c 

14 streets in a ward contain 25, 17, 5, 59, 64, 22, 38, 16, 21, 12, 14, 38, 
17, 23 houses respectively. Make a random selection of 6 houses from all 
371 houses. 

The successive sub-totals are 25, 42, 47, 106, 170, 192, 230, 246, 267, 279, 
293, 331, 348, 371. A table of random numbers gives the numbers 72, 128, 
96, 326, 199, 202. The units 72 and 96 therefore faU in the 4th street, the 
unit 128 in the 5th street, the units 199 and 202 in the 7th street, and the unit 
326 in the 12th street. Since 72 - 47 = 25, and 96 47 = 49, the 25th 
and 49th houses in the 4th street are selected, etc. The numbering of four 
streets, involving 199 houses, is required 

3.3 Stratification with uniform sampling fraction 

In a stratified sample the population of sampling units is subdivided into 
groups or " strata " before selection of the sample. These strata may all 
contain the same number of units, or differing numbers of units. If a uniform 
sampling fraction is used, the same fraction of the units of each stratum is 
included in the sample, the units selected being chosen at random from all 
the units within each stratum. A stratified sample is thus equivalent to a set 
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of random samples on a number of sub-populations, each equivalent to one 
stratum. 

Stratification has two purposes. The tirst is to increase the accuracy of 
the overall population estimates. The second is to ensure that subdivisions 
of the population which are themselves of interest are adequately represented. 
Such subdivisions may be termed domains of study. Maximum overall accuracy 
will be attained if the strata are so chosen that the units within each stratum 
are as similar as possible. It will often be advisable to use domains of study 
as strata, however, even if some other form of stratification might be expected 
to give somewhat more accurate results. If there is marked heterogeneity 
within some or all of the domains of study these may themselves be subdivided 
into smaller strata for the purposes of sampling. 

Stratification affects the estimation of the sampling error. Since in a 
stratified sample only variation within strata gives rise to sampling error, it 
is this component of variation that requires estimation, and this can in general 
only be done from differences between units in the same stratum. It is 
therefore necessary, if an estimate of sampling error is required, that the strata 
be of such size that the sample contains two or more units from at least the 
majority of strata. In certain cases, in which the use of strata containing only 
a single selected unit appears advisable on account of the gain in accuracy 
thereby obtained, special methods, often of an approximate nature, of estimating 
the sampling error have to be adopted. The matter is discussed further in 
Section 8.15. 

If the sampling units are already classified in the required strata, the 
selection of a stratified sample can be made in the same way as a random sample, 
the requisite number of units being selected at random from each stratum. 
If, however, the population is not so classified, selection by this method would 
necessitate prior classification. In this case, if the numbers of units in the 
different strata are known, an alternative procedure is available. This consists 
of selecting a sample at random ; keeping a tally, as the selection proceeds, 
of the numbers falling in each stratum ; and rejecting any further members 
of a particular stratum as soon as the requisite number for that stratum has 
been obtained. On the other hand, if the numbers of units in the different 
strata are not known, a count covering the whole population will in any case 
have to be made, in which case a classification which will serve as a basis for 
the subsequent selection of the sample may well be carried out simultaneously. 

Unless all the strata contain the same number of units it will usually happen 
that the chosen sampling "fraction will not give an exact whole number of units 
in each stratum. In this case the nearest whole number of units has to be 
t^ken. We may thus differentiate between the working sampling fraction^ 
"which with stratification with a uniform sampling fraction is the same for all 
strata, and the exact sampling fractions, which will differ slightly from the 
working sampling fraction. The use of the working sampling fraction in the 
analysis of the results leads to minor inaccuracies, but these will seldom give 
rise to errors of any practical importance. 
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It may be noted that if the numbers of units from the whole population 
falling in different strata are known, and a random sample is taken which is 
sufficiently large to ensure that adequate numbers of units are obtained from 
all strata, adjustment of the results so that the different strata are represented 
in their correct proportions will lead to practically the same accuracy as would 
be obtained with a stratified sample. Which of these two alternative courses 
is adopted in any particular case is a question of convenience. If the selection 
of either type of sample is equally simple It is best to use a stratified sample, 
as the computations are thereby simplified. In certain cases, however, the 
classification of units into strata may only be possible by means of information 
obtained in the course of the survey, in which case a random sample, with 
subsequent adjustment, is required. Thus, for example, in a survey of a human 
population, the age distribution of the whole population may be known, but 
prior selection of individuals of particular ages may be impossible owing to 
lack of information on these ages. 

3.4 Multiple stratification 

A population may be stratified for two or more different characteristics. 
If selection is made from sub-strata formed of the various combinations of the 
main classifications the procedure is exactly equivalent to ordinary stratification, 
the sub-strata being equivalent to strata. Thus we may stratify farms 
according to size and according to geographical regions. If the farms in each 
region are classified into size-groups before taking the sample then the region 
size-group combinations form the individual sub-strata. 

Occasionally the number of units of the population falling in each set of 
main strata may be known, e.g. from prior census data, but not the numbers in 
the various sub-strata. Thus, in the above example there may be information 
on the numbers of farms in the different size-groups, and also on the numbers 
in the different geographical regions, but not on the numbers of each size-group 
in each region. In such cases we may attempt the selection of a sample which 
will have the right proportions for each set of main strata. Such stratification 
may be termed multiple stratification without control of sub-strata. The selection 
of such a sample, however, presents both theoretical and practical difficulties, 
and the calculation of the sampling error is also troublesome. 

In the rare cases in which multiple stratification without control of sub-strata 
is deemed to be necessary a simple procedure of selection which should give 
a reasonably satisfactory sample is as follows. Units are selected at random 
until the total of every row and column of the two or more-way table for the 
sets of strata is at least equal to the required total. The excesses of these 
marginal totals are calculated, and numbers chosen for deduction from the 
sub-strata totals which together make up these excesses, and which, subject 
to these restrictions, are about proportional to the sub-strata totals. (A method 
of calculating such numbers is shown in the following example.) The 
corresponding numbers of units are then rejected from the sub -strata groups, 
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those rejected being selected at random. If the original selection was strictly 
random the condition of randomness will be fulfilled if those last selected 
are rejected. 

Example 3.4 

A sample of 1000 is required from a population classified into two sets of 
four strata, the sub-strata totals being unknown, but the correct strata totals 
for the sample being known to be 120, 280, 350, 250, for each set of strata. 

After a sample of 1125 units had been drawn, the numbers of units in the 
16 sub-strata shown in Table 3. 4. a were obtained. 

TABLE 3. 4. a TWO-WAY STRATIFICATION WITHOUT CONTROL OF 

SUB-STRATA : INITIAL SAMPLE 





Strata A 








Strata B 




Total 


Required 


Excess 




1234 








1 


37 40 35 8 


120 


120 





2 


39 140 82 56 


317 


280 


-f 37 


3 


45 97 173 93 


408 


350 


+ 58 


4 


8 40 86 146 


280 


250 


+ 30 


Total . 


129 317 376 303 


1125 


1000 


125 


Required 
Excess . 


120 280 350 250 

4. 9 +37 4.26 +53 


1000 
125 









The three stages of the calculation are shown in Table 3 . 4 , b. In stage 1 
the excesses of the rows have been distributed in proportion to the number? 

TABLE 3,4.b CALCULATION OF NUMBERS OF UNITS TO BE REJECTED 





Stage 1 


Stage 2 


A l AI A 3 AI 


Total 


A i A 9 A 9 A 4 


Total 


5 

I: 

** 


0000 
5 16 9 7 

6 14 25 13 
1 4 9 16 



37 

58 
30 


0000 

- 1 4-2 -4 4-3 
-2 +1 ~~9 -f5 
0-4+9 


-f 1 

OiCnO O 


^Total 
** Required 


12 34 43 36 
9 37 26 53 


125 
125 


-3 +3 ~ 17 -f-17 





Difference 


4.3 _ 3 4, 17 _ 17 
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Stage 3 




-4-t A% A 3 A A 


Total 


Bl 

















J9 a 

















BZ 


41(0) 


41 


42 


41(42) 


45 


B. 





- 1 


_ 1 (_ 2) 


-3 (-2) 


-5 


Total 


41(0) 





41(0) 


-2(0) 






TABLE 3.4.c NUMBERS OF UNITS REJECTED, NUMBERS IN FINAL SAMPLE AND 

CORRECT SUB-STRATA TOTALS 





Numbers of units rejected 


Numbers in final sample 


AI A% A a AI 


Total 


AI A% A a AI 


Total 


BI 

B* 
B a 
B 4 


0000 
4 18 5 10 
4 16 18 20 
1 3 3 23 



37 

58 
30 


37 40 35 8 
35 122 77 46 
41 81 155 73 
7 37 83 123 


120 
280 
350 
250 


Total 


9 37 26 53 


125 


120 280 350 250 


1000 







Correct 


sub-strata 


totals 


A t A. A, A t 


Total 


1 
a 


40 
40 
30 
10 


40 
120 
80 
40 


30 

80 
160 
80 


10 
40 
80 
120 


120 
280 
350 
250 


Total 


120 


280 


350 


250 


1000 



of units in the sub-strata of each row. The distributed excesses are added by 
columns and compared with the required excesses. The differences, with 
signs reversed, are distributed by columns in stage 2, excluding the first row, 
and the process is repeated for rows in stage 3. The numbers are now small 
and empirical adjustments, shown in brackets, have been chosen to make 
the column totals zero. 

27 



SECT. 3.5 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

The three stages are then summed and deducted from the numbers in the 
original sample, as shown in Table 3.4.C. This table also shows the correct 
proportional sub-strata totals of the population from which the sample was 
drawn. The appropriate statistical test* shows that a satisfactory sample has 
been obtained. 

The above process does not necessarily converge, but will usually do so 
in practical cases. If a negative value for number of units rejected is obtained 
for any sub-stratum this can be brought to zero by an empirical adjustment. 



3.5 Stratification with a variable sampling fraction 

In certain types of material very considerable gains in accuracy will result 
if different sampling fractions are used for the different strata. The greatest 
accuracy for a given number of units will be attained if the sampling fractions 
are proportional to the within -strata standard deviations")" of the units, ff the 
sampling fractions are denoted by f l9 /%,.- and the standard deviations by 
o a, ... we have 



In some cases this formula may give sampling fractions greater than unity for 
some of the strata. If this occurs the whole of these strata are included in the 
sample, t 

A particularly important application of the variable sampling fraction is 
to material stratified into size-groups. In such material the various quantitative 
characteristics of the units under investigation often have within-strata standard 
deviations which are roughly proportional to the mean sizes of the units in the 
different size-groups. In this case the sampling fractions should be taken about 
proportional to these mean sizes. If quantitative characteristics very highly 
correlated with size of unit are under investigation, the ranges of the size- 
groups may give good estimates of the relative within-strata standard deviations. 
The sampling fractions may then be taken proportional to these ranges. 
Changes with time, however, are usually by no means so highly correlated with 
size, and when the changes are of interest, sampling fractions proportional to 
the mean sizes of the size-groups will usually be best, 

The above rules will determine sampling fractions which give the maximum 
accuracy for estimates of the population values. In cases in which the values 
for the individual strata are of interest, i.e. cases in which the strata themselves 
form domains of study, it is also important to see that all the strata are adequately 
represented in the sample, and for this reason the rule of strict proportionality 

*X = 9-1, 9 degrees of freedom, 

f The meaning of this term and the method of estimation will be explained in 
Chapter 7. 

JWhen variations in cost have to be taken into account the formula 8.17.H is 
appropriate. 
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to the standard deviations, or to the mean sizes of the size-groups, often requires 
some modification** 

When several quantities are under investigation, it will usually be found 
that their within -strata standard deviations are not in quite the same proportions. 
This, however, is not a very serious problem in practice, since any sampling 
fractions which are somewhere near the optimal will give results which are 
nearly as accurate as those given by the optimal fractions. Consequently 
there is usually no great difficulty in choosing suitable sampling fractions 
which will reconcile the various conflicting requirements, f 

Since, for these reasons, sampling fractions are often used which are not 
optimal, we have preferred not to adopt the term " optimal allocation," which 
has sometimes been used to denote stratified sampling with a variable sampling 
fraction. 

The within-strata standard deviations can only be estimated from data 
relating to the material to be sampled, or from data derived from similar 
material, but general knowledge of the behaviour of material of a particular 
type, e.g. material stratified into size-groups, will often enable suitable sampling 
fractions to be chosen with all necessary accuracy. It is sometimes suggested 
that a preliminary survey should be undertaken merely to determine the optimal 
sampling fractions, but this is rarely worth while, though if a preliminary survey 
is being undertaken for other purposes it will of course also serve to improve 
the sampling fractions. 

3.6 Systematic samples from lists 

Although the importance of the principle of random selection in sampling 
has been stressed, much practical sampling is in fact not fully random in 
character. Thus a frequent method of selecting a sample, when a list of the 
units of the population to be sampled is available, is to take every qth entry 
on this list. This may be termed a systematic sample from the list. Other 
more complicated systematic procedures may occasionally be adopted for 
special purposes. 

It is customary, and salutary, to determine the first entry by selecting a 
number at random between 1 and #, but this element of randomness does not 
convert the sample into a random one.t A systematic sample would be 
equivalent to a fully random sample if the list were arranged wholly at random. 
No lists, however, are arranged at random. The nearest approach to random 
order is probably provided by alphabetical lists, though even these have certain 
non-random characteristics : in this country, for instance, a large proportion 
of the Scotsmen will be found under the letter M. If every qth entry is taken, 
a kind of partial stratification will therefore be obtained, and the sample 
will be somewhat more precise than a fully random sample. Thus in a 

* The situation when domains of study cut across strata is discussed in Sections 
9.3 and 9,4. . n . _ . 

t An exact solution of this problem is given in Section 10.4. 

t Evceot in the trivial sense that the sample is a random sample of 1 unit out ol 
q units, each unit being composed of the aggregate of a set of all entries at spacing q. 
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systematic sample of farms taken from a list of farms arranged by parishes, 
the proportion of farms drawn from each parish will be more or less constant, 
provided the sampling interval is small compared with the number of farms 
in a parish. * 

Owing to the lack of definition of the strata it is impossible to make a fully 
valid estimate of the sampling error, but provided there are no periodic 
features in the list the sample will not be biased. An estimate of the sampling 
error which is good enough for practical purposes can usually be made by 
regarding the sample as a sample stratified in the major subdivisions of the list, 
ignoring any minor and ill-defined groupings. If the sampling error is estimated 
as if the sample were fully random an overestimate will be obtained, the 
inaccuracy being greater the more marked the similarity of the neighbouring 
entries in the list. 

In general, systematic sampling from lists will be found to be quite 
satisfactory provided care is taken to see that there are no periodic features 
in the list which are associated with the sampling interval. The method is 
often much more convenient than random or stratified random sampling, 
since the labour of making a proper random selection, which in an extensive 
sampling scheme is often very considerable, is avoided. It must be clearly 
recognized, however, that the responsibility for the judgment that the material 
is such that systematic sampling will give satisfactory results rests with the 
investigator. 

Sampling in which the selection is wholly systematic should be clearly 
distinguished from sampling in which there is proper random selection of 
sampling units which are themselves systematic aggregates of smaller sub- 
divisions of the material Thus a common method of sampling rows of potatoes 
has been to use sampling units consisting of every 20th plant, two such sampling 
units being selected at random from each row by selecting two numbers at 
random between 1 and 20- Such a method of sampling fulfils all the conditions 
required for fully valid random sampling. 

3J An example of alternative ways of sampling Mghly variable 
material 

In order to illustrate the various methods of sampling which have been 
discussed in the preceding sections, we will consider their application to 
the problem of determining the area under wheat in an English county. For 
this purpose the wheat acreages of Hertfordshire farms for 1939 were utilized. 

The choice of problem and of basic material were dictated partly by 
convenience, and partly by the need for a complete set of data covering highly 
variable material, since an investigation of different sampling methods can be 
carried out most easily when data are available for the whole of the material. 

It should be emphasized that this example is to be regarded as illustrative 
only. Sampling methods are very unlikely to be required for the determination 

* Methods of taking a systematic stratified sample from a list or card index are 
described in Section 10.2. 
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of crop acreages in a country such as England, since all farmers make returns 
of their acreages each year ; the only possible use of sampling would be at 
the compilation stage, where it might be used to avoid the necessity of totalling 
the whole of the returns. The present investigation is not fully relevant to 
this problem, however, since the acreage of only one of the most extensively 
grown crops is considered, and that for only one county. Considerably greater 
errors, proportionately, may be expected in the less common crops. 

Records were available for 2496 farms, and the acreage of wheat, and also 
the total acreage of crops and grass, which is virtually the total acreage of 
farmed land and will be termed the size of the farm, were abstracted for each 
farm. The original records were arranged by districts, by parishes alphabetically 
within districts, and by farmer's name alphabetically within parishes. This 
order was preserved in the abstract. The return for any farm, or " holding," 
does not necessarily relate solely to land in a given parish, but may include 
land in other parishes farmed by the same farmer ; farmers with two or more 
distinct farms may make separate returns for these farms or may include them 
all in a single return. The total area of wheat in the county, from the abstracted 
returns, was 44,676 acres, and the total area of crops and grass was 273,074 
acres.* 

If farms are taken as sampling units the dominant source of variation in 
wheat acreage will be variation in size of farm, since farms range from 1 acre to 
over 1000 acres, and no farm can have more than a fraction of its area under 
wheat. Stratification by size of farm is therefore indicated. The use of a 
variable sampling fraction will also be advantageous, since the wheat acreages 
of the large farms will be much more variable than those of the smaller farms. 
Further stratification by districts is possible, but is not likely to give much 
increase in precision unless the incidence of wheat growing in the different 
districts is very markedly different. In any case a systematic method of 
selection from the list, which in view of the alphabetical method of arrangement 
will be quite satisfactory, will give the effect of stratification by districts. 

For comparative purposes the following samples were taken : 

(1) a random sample of 1 in 20 farms, 125 farms in all ; 

(2) a stratified random sample with a uniform sampling fraction of 1 in 20 ; 

(3) a stratified systematic sample with a variable sampling fraction, the 
fraction being approximately proportional to mean size of farm within 
each size-group, and chosen so as to give about the same number of 
farms, actually 135, as samples 1 and 2 ; the systematic method of 
selection within size-groups results in approximate stratification by 
districts also. 

* It may be noted that these values disagree with the values shown in Agricultural 
Statistics, viz. 46,281 acres and 278,380 acres respectively. The reasons for this 
discrepancy need not concern us here, but it provides an illustration of the fact 
that disagreement between sample and complete returns must not be assumed to 
be necessarily solely due to sampling error. 
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The size-groups chosen, the number of farms in each group, and the 
sampling fractions and numbers of farms for sample 3 are shown in Table 3 . 7 . a. 
For sample 2 the last two size-groups were combined. 

TABLE 3, 7. a HERTFORDSHIRE FARMS, 1939: SIZE-GROUPS, 

NUMBERS OF FARMS, AND CHOSEN VALUES OF THE VARIABLE 
SAMPLING FRACTION 



Size-group 
(acres crops 
and grass) 


No. of 
farms in 
county 


Sampling 
fraction 


No. of 

farms in 
sample 


1-5 


435 


Nil 





6-20 


519 


1/200 


3 


21-50 


357 


1/60 


6 


51-150 


519 


1/20 


26 


151-300 


400 


1/10 


40 


301-500 


215 


1/5 


43 


601- 


51 


1/3 


17 




2,496 




135 



As pointed out in Section 3,3, the random sample can be stratified after 
selection so as to eliminate the effect of variation between size-groups, provided 
the number in each size-group of the population is known. Variation due to 
size may also be eliminated by using size as supplementary information, provided 
information on size of farm is available for each farm in the sample, and the 
total area of all farms in the county is known. Either the ratio or the regression 
method may be used, as explained in Chapter 6. 

The estimates of the wheat acreage and of the number of farms growing 
wheat obtained from these samples, their estimated sampling standard errors, 
and in the case of the acreage the actual errors of the estimates, are shown 
in Table 3.7.b. The sampling standard errors, as will be explained in 
detail in Chapter 7, give measures of the average magnitudes of the sampling 
errors that may be expected with given methods of sampling and of estimation. 
The computations leading to these estimates are set out in Chapters 6 and 7, 
where tables giving the actual values of the sample units for the three samples 
(Tables 6. 6. a, 6. 5. a and 6. 7. a respectively) will also be found. 

It is apparent from the values of the sampling standard errors for wheat 
acreage that, as is to be expected, both stratification and the use of a variable 
sampling fraction have resulted in large gains in accuracy.* The use of 
supplementary information on size, whether by stratification after sampling, 
by ratio or by regression, serves to make the random sample about as accurate 

* The word accuracy is used in this book to denote the expected accuracy of an 
estimate, as indicated by its sampling standard error. It is occasionally also used to 
denote the actual accuracy, as indicated by the actual error (usually unknown), but this 
should not cause any confusion. 
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as the stratified sample with a fixed sampling fraction. This is also to be 
expected. 

The numbers of units, i.e. farms, required to attain the same accuracy 
with different methods of sampling and estimation may be taken as roughly 
proportional to the squares of the standard errors, allowance being made 
for the greater number of units included in sample 3. These are shown in 
the column " relative variance," the random sample with direct estimation 

TABLE 3.7.b HERTFORDSHIRE FARMS: COMPARISON OF VARIOUS TYPES OF 

SAMPLES AND METHODS OF ESTIMATION 





Method of 


Wheat acreage 


No. of farms growing wheat 


No. 


selection 
and 
estimation 


Estimate 


Standard 
error 


Actual 
error 


Relative 
variance 
per farm 


Estimate 


Standard 
error 


Relative 
variance 
per farm 


la 


Random, 


46,020 


7,950 


+ 1,340 


100 


900 


104-6 


100 




direct 
















16 


Random, 


41,100 


4,320 


- 3,580 


30 


860 


75-2 


52 




stratified 


















after 


















selection 
















Ic 


Random, 


41,570 


3,940 


-3,110 


25 


Not calculated 




by ratio 












Id 


Random, 


40,400 


M30 


- 4,280 


27 


Not calculated 




by 














regression 
















2 


Stratified, 


40,220 


db *,110 


4,460 


27 


1,080 


db 71-6 


47 




direct 
















3 


Variable 


42,765 


2,550 


- 1,915 


10 


911 


88-9 


72 




sampling 


















fraction, 


















direct 

















being set at 100. Thus stratification by size, or elimination of variation due to 
size in the estimation process, reduces the number of farms required by a 
factor of about 4, and the variable sampling fraction results in a further 
reduction by a factor of about 2J. 

The situation with respect to number of farms growing wheat is somewhat 
different. Stratification has again resulted in considerable increase in accuracy, 
though the gain is not so great as with acreage. The sample with a variable 
sampling fraction, on the other hand, is not so accurate as the ordinary stratified 
sample. The sampling fractions which are optimal for the determination of 
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wheat acreage are by no means optima! for the determination of the number 
of farms growing wheat. 

The actual errors of the estimates bear little relation to the sampling 
standard errors, except that they are in no case markedly larger than these 
standard errors. The random sample without adjustment gives the most 
accurate estimate of acreage, though this estimate has the largest sampling 
standard error, and adjustment of the random sample makes the actual error 
larger, though it reduces the sampling standard error. This is an illustration 
of the fact that an inaccurate method of sampling will sometimes by chance 
give an accurate estimate. The accuracy of a sampling procedure must never 
be judged by the magnitude of a single discrepancy ; a large discrepancy 
provides some evidence that a method is inaccurate, but a single small 
discrepancy provides practically no evidence that it is accurate. 

3.8 Multi-stage sampling 

In multi-stage sampling the material is regarded as made up of a number 
of first-stage sampling units, each of which is made up of a number of second- 
stage units, etc. The sampling process is carried out in stages. At the first 
stage the first-stage units are sampled by some suitable method, such as 
random or stratified sampling. At the second stage a sample of second-stage 
units is selected from each of the selected first-stage units, again by 
some suitable method, which may be the same as or different from the 
method employed for the first-stage units. Further stages may be added as 
required. 

By suitable choice of sampling fractions it is often possible to keep the over- 
all sampling fraction (i.e. the product of the sampling fractions at the different 
stages) constant for different parts of the population. This leads to considerable 
simplification of the computations (see Section 6.18). 

Multi-stage sampling introduces a flexibility into sampling which is lacking 
in the simpler methods. It enables existing natural divisions and subdivisions 
of the material to be utilized as units at the various stages, and, as pointed out 
in Section 2.9, it permits the concentration of the field work of censuses and 
surveys covering large areas. On the other hand, for the reasons there given, 
a multi-stage sample is in general less accurate than is a sample containing 
the same number of final-stage units which have been selected by some 
suitable single-stage process. 

Multi-stage sampling also has the important advantage that subdivision 
into second-stage units, i.e. the construction of the second-stage frame, need 
only be carried out for those first-stage units which are actually included in 
the sample. It is therefore particularly valuable in surveys of undeveloped 
areas where no frame exists which is sufficiently detailed and accurate for 
subdivision of the material into reasonably small sampling units. 

Since there are many variants of multi-stage sampling which are possible 
for any given type of material, careful investigation is often required before a 
decision as to the procedure which is best for any particular purpose can b>* 
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reached. This matter will be discussed in detail in Chapter 8, after the methods 
of evaluating sampling errors have been described. 

3.9 Sampling with probabilities proportional to size of unit 

If we have areas demarcated on a map, such as fields, and a point is located 
at random on the map, the probabilities of the point falling within the boundaries 
of the different fields are clearly proportional to the areas of the fields. 
Consequently areas can be selected at random with probabilities proportional 
to their size by the simple procedure of taking random points on the map. 
It will be noted that such a process of selection may result in the same area 
being included twice or more in the sample. In this case it must be counted 
twice or more. We cannot, without distorting the probabilities, make a further 
selection in the manner followed with equal probabilities. 

The principle has applications in agricultural surveys designed to determine 
the acreage and yield of different agricultural crops, total cultivated area, etc. 
All that is required for acreage is to determine the proportion of points which 
fall in areas of the given type. The method is therefore particularly attractive 
when carrying out surveys of the areas of crops, etc., by aerial survey, provided 
the different crops can be recognized on the photographs, since it avoids all 
the measurements of area which would be required if an ordinary random 
sample of areas were taken. The sampling of the fields with probabilities 
proportional to size is in this case equivalent to the sampling of small unit 
areas of equal size whose locations are determined by the random points. 
When only areas require to be determined the sizes of the fields in which the 
random points fall are in fact immaterial. 

The analogy with the case of a stratified sample with a variable sampling 
fraction indicates that under certain circumstances greater precision may be 
expected from areas selected with probabilities proportional to size than will 
be obtained if they are selected with equal probabilities. 

In the case of yield determinations, when the total acreage is known, the 
determinations of the yield from a sample of fields selected with probability 
proportional to size may always be expected to give a more accurate estimate 
of the mean yield per acre and total yield than will similar yield determinations 
on a random sample of fields irrespective of size. If the total acreage is not 
known then the situation is more complicated, but here again sampling with 
probabilities proportional to size is often advantageous. 

Sampling with probabilities proportional to size of unit, or to some other 
known quantitative character of the units, may be carried out on other types of 
material by forming a cumulative or running total of the sizes of the units, and 
selecting numbers at random from the total of all the units in the manner of 
Example 3.2.C.* Stratification by size and the use of a variable sampling 
fraction will usually be preferable in such cases, however, on the grounds both 
of accuracy and convenience, except in the special circumstances to be described 
in the next section. 

* An alternative method is described in Section 10.8. 
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3.10 Sampling from within strata with probabilities proportional to 
size of unit 

Apart from area sampling, sampling with probabilities proportional to size 
of unit is mainly of use when the units are stratified according to some other 
characteristic, and the number of units to be selected from each stratum, or 
from some of them, is small. In this case an ordinary stratified sample will 
give either inaccurate or biased estimates when the ratio method of estimation, 
explained in Chapter 6, is used. The bias or inaccuracy is removed by selecting 
the units from within strata with probabilities proportional to size. This 
fact appears to have been first recognized by Hansen and Hurwitz (1943, A). 

This procedure is particularly useful in conjunction with two-stage sampling 
with large first-stage units of variable size. The first-stage units are selected 
from within strata with probabilities proportional to size, and the second-stage 
units are selected with probabilities inversely proportional to the sizes of the 
first-stage units. By this device the overall sampling fraction is kept constant, 
with consequent simplification of the computations. 

As before, when more than one unit is required from a stratum, selection 
with probability strictly proportional to size can only be simply effected if a 
unit which happens to be selected twice is counted twice. Generally, however, 
when each stratum contains only a few large units, duplication of units is not 
desirable ; instead a further unit is selected, with slight inexactitude in the 
probability of selection, and consequent slight, but usually negligible, bias 
in the results. 



3.11 An example of sampling by administrative areas 

Reverting to the illustrative example of Hertfordshire farms considered in 
Section 3.7, we may now investigate the effect of taking parishes as sampling 
units, or as first-stage units In two-stage sampling with farms as second-stage 
units. 

The use of parishes as sampling units may be expected to result in a sample 
which is somewhat less accurate than a sample containing the same number 
of farms distributed over all parishes. Nevertheless, if actual visits have to 
be made to the farms it may pay to use parishes as sampling units, or as first- 
stage units in two-stage sampling. In analogous situations in undeveloped 
countries, where definition of farm boundaries may present difficulties, the 
complete survey of small administrative or other areas may also be better than 
any attempt to sample individual farms. 

Inspection of the Hertfordshire data showed that the total farm area (crops 
and grass) included in the returns of farmers of a single parish was very 
variable, partly owing to variations in size of the parishes, and partly because 
some of the parishes were mainly urban In character. It is therefore best to 
use individual parishes as sampling units only if they contain a certain minimum 
acreage of crops and grass. The minimum chosen was 2000 acres, the remaining 
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parishes being grouped roughly in the order in which they appeared in the 
alphabetical list, to form " combined " parishes containing over the minimum 
acreage. The effect of this combination is shown in Table 3. 11. a. 

TABLE 3. 11. a NUMBERS OF FARMS, PARISHES, AND "COMBINED" PARISHES 

IN THE DISTRICTS OF HERTFORDSHIRE 



District 

No. 


District 


No. of 
farms 


No. of 
parishes 


No. of 
parishes after 
combination 


1 


Barnet . 


230 


17 


7 


2 


Bishop's Stortford 


316 


23 


16 


3 


East Herts. 


564 


31 


20 


4 


Hitchin 


553 


36 


25 


5 


St. Albans 


218 


9 


7 


6 


Tring . 


424 


16 


11 


7 


Watford 


191 


8 


5 






2,496 


140 


91 



Districts were used as strata in this sampling, 1 in 5 " combined " parishes 
being taken per district, i.e. 17 parishes in all. 

Two samples were taken. In sample A the parishes were selected in the 
ordinary manner, with equal probability of selection for each parish. In 
sample B selection with probabilities proportional to size was employed. The 
parishes of sample B were also sub-sampled in two ways, samples B l and B 2 . 
In sample B l a uniform sampling fraction of J was taken for sampling at the 
second stage, with stratification by size, using the size-groups of Table 3 . 7 . a 
with the last two size-groups combined. In sample B 2 a variable sampling 
fraction was used with values $ f r size-group 1-50 acres, J for 51-150 acres, 
-| for 151-300 acres, and 1 for over 300 acres. Sample B is given in detail 
in Example 6.17. 

The relative efficiency of the various methods is discussed in Section 8 . 9. 
The results are summarized in Table 3.11.b. This table is similar to 
Table 3.7.b, except that estimated average values of the standard errors are 
given, and not those calculated from the actual selected samples. These latter 
are not sufficiently accurate for comparison owing to the small number of 
parishes involved. 

It will be seen that a sample of 1 in 5 parishes provides results which are 
decidedly more accurate than a stratified random sample of 1 in 20 farms 
with a uniform sampling fraction, but somewhat less accurate than a similar 
sample with a variable sampling fraction. The stratified random sample of 
1 in 20 farms is 1-29 times as accurate as sample B^ allowing for differing 
numbers of farms. The similar sample with the variable sampling fraction is 
1*83 times as accurate as sample B 2 . Sample B is somewhat more accurate 
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than sample A, particularly if the unbiased estimate given by the overall ratio 
is used for sample A. The difference is not marked, however, since the com- 
bination of parishes has created units which do not differ excessively in size. 

TABLE S.ll.b HERTFORDSHIRE FARMS: SAMPLES FOR WHEAT ACREAGE WITH 

" COMBINED " PARISHES AS SAMPLING UNITS OR FIRST-STAGE UNITS IN 
TWO-STAGE SAMPLING 



Sample 


No. 
of 

stages 


Method of Sampling 


Method 
of 

estimation 


Estimate 


Expected 
standard 
error 


Actual 
error 


Relative 
variance 


1st stage 


2nd stage 










(Overall 


41,730 


3,080 


~~ 2,950 


100 










ratio 








A 


1 


Stratified by 
district 





I District 


41,010 


3,010 


- 3,670 


95 










\ ratios 










B 


1 


Stratified by 


. 


District 


46,660 


2,870 


+ 1,980 


87 






district, 




ratios 














probability 


















proportional 


















to size 














BI 


2 




Stratified 


District 


48,930 


4,050 


-f 4,250 


259 








random 


ratios 










B, 


2 


at 


Variable 


District 


45,600 


3,4(50 


-f 920 


127 








sampling 


ratios 
















fraction 













3.12 Multi-phase sampling 

It is sometimes convenient and economical to collect certain items of 
information from the whole of the units of a sample and other items of 
information from some only of these units, these latter units being so chosen 
as to constitute a sub-sample of the units of the original sample. This may 
be termed two-phase sampling. Further phases may be added if required. 

Multi-phase sampling is of value hi several ways. Its simplest application 
is to the case in which the number of units needed to give the required accuracy 
on different items is widely different, either owing to the fact that the variability 
of the associated variates is different, or because the accuracy required is 
different. If no use is made of the relations between the different variates, 
such multi-phase sampling is equivalent to taking samples of different sizes 
for the different items. 

First-phase information may also be used as supplementary information in 
order to improve the accuracy of second-phase information, by the same 
methods, ratio and regression, that are applicable where supplementary 
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information on the whole population is available. Thus, in a crop estimation 
survey based on farms as sampling units, a relatively large sample of farms 
may be taken for the determination of the acreage of the crop, and the yields 
may be determined on a sub-sample only of these farms. 

If the first-phase information is collected prior to the second-phase 
information the first-phase information may be used as a basis for the sub- 
sampling process, e.g. by stratification of the first-phase units for the selection 
of the second-phase sample, with or without the use of a variable sampling 
fraction at the second phase. 

It will be noted that in both these latter applications of two-phase sampling 
the methods followed are the same as those adopted in ordinary single-phase 
sampling, the population being replaced by the first-phase sample ; but since 
the first-phase information is not known for the whole population it is itself 
subject to sampling error, and this must be taken into account when estimating 
the sampling errors of the estimates of the second-phase variates. 

Multi-phase sampling differs structurally from multi-stage sampling in 
that in the former the same sampling units are used throughout, whereas in 
the latter a hierarchy of sampling units is used. Multi-phase sampling may be 
combined with multi-stage sampling. In a scheme for the estimation of the 
acreages and yields of agricultural crops, for example, a two-stage sample of 
farms and parishes may be taken for the estimation of acreages, and a sub- 
sample of these farms may be taken for the estimation of yields. 

3.13 Balanced samples 

If the average value of some quantitative character of the units, such as 
size, is known for the whole population, it is possible, provided the sizes of 
the individual sampling units are known, to select a sample in such a manner 
that the average size of the selected units is equal to the average size of all 
the units of the population. Such a sample will only be satisfactory if it is 
otherwise equivalent to a random sample, in which case it may be termed 
a balanced sample. 

Balance may be employed in conjunction with stratification for some other 
character. In this case balance may be effected either for the whole population, 
or for each of the strata separately. The latter course should only be adopted 
if the number of units selected from each stratum is moderately large : otherwise 
undue restrictions will be placed on the sample which will result in the selection 
of a sample which is not otherwise equivalent to a random sample. On the 
other hand, when the strata are balanced separately more accurate estimates 
of the separate strata means and totals will be obtained, and the accuracy 
of the estimates of the overall population means and totals may also be 
somewhat improved. 

Balancing for a known quantitative character provides an alternative to 
stratification by size-groups in this character. Balancing, however, will only 
be effective if the differences in the quantity or quantities under investigation 
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are approximately proportional to differences in the known character, whereas 
stratification by size-groups will take account of any type of relationship. As 
will be seen in Chapter 7, the estimation of sampling errors is simpler in the 
case of a stratified sample. 

The increased accuracy resulting from balancing can equally well be 
obtained, at the expense of some additional computational labour, by adjusting 
the results of an unbalanced sample by the use of regression in the manner 
explained in Chapter (>. Since the additional labour of adjustment is nearly 
proportional to the number of variates under investigation, the advantages of 
balancing as opposed to regression increase as the number of variates increases. 

Balancing can also be carried out for a character which is inherently 
qualitative, but which for the sampling units actually employed acts as a 
quantitative variate because the sampling units are themselves aggregates of 
smaller natural units. Thus if a sample of a human population composed of 
two different races is being taken, the sampling units being administrative 
areas containing numbers of individuals, the sample can be balanced for the 
percentages of individuals in the two races. If the individuals were the sampling 
units then balance would be equivalent to stratification by races. 

A balanced sample is best selected by using a process of replacement. 
In the first place a random or stratified random sample of the required size is 
selected, record being kept of the order of selection. The average value of 
the known quantitative character is then calculated for the sample. This will, 
in general, not be equal to the average for the population, indicating lack of 
balance. A further sampling unit is then drawn, and compared with the first 
unit of the original sample. If balance is improved by substitution of the 
new unit, this is done, otherwise the original unit is retained. The process 
is then repeated for the second and following units of the original sample 
until an adequate degree of balance is attained. 

The selection of a sample balanced for more than one character presents 
more difficult problems, and will not be discussed here. 

Balance, in the cruder form known as purposive selection, was at one time 
extensively used in sample censuses and surveys. No rigorous rules of 
selection were followed, however, with the result that many purposively selected 
samples were by no means equivalent to balanced random samples. Thus it 
frequently happened that the selection was confined to sampling units having 
values of the known quantitative character near the average. Clearly in such 
samples the variability of the known quantitative character, and of any other 
characters closely correlated with it, will be considerably less than the true 
variability in the population. The sample may also be unrepresentative in 
other ways. 

Purposive selection was often used in an attempt to avoid the necessity, 
which was otherwise apparent, of employing reasonably small sampling units. 
Thus Gini and Galvani (1929, A) selected a sample from the Italian Census Data 
of 1921 which consisted of all the returns of 29 out of the 214 circondari into 
which the country is divided, using seven control characters. Agreement of 
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the average values of other characters in the sample and population was poor, 
and that of the frequency-distributions of such characters was even worse. 
The real weakness here is the use of excessively large units, though even with 
smaller units the use of purposive selection without rigorous rules of selection 
is always liable to give unsatisfactory results. There Is, moreover, no means 
of judging its reliability. 

For these reasons purposive selection has ceased to be extensively used, 
and in modern sampling work it has largely been replaced by more thorough 
application of the principles of stratification, etc. Provided proper attention 
is paid to the process of selection, however, there is no fundamental objection 
to balanced samples. These have a certain limited usefulness in some types 
of census and survey work, though it must be recognized that the need for the 
subdivision of the population into an adequate number of sampling units is 
in no way obviated by balancing for one or more quantitative characters. 

3.14 Systematic samples from areas 

A common method of sampling material continuously distributed either 
in space or time is to take sampling units distributed at equal intervals over 
the material. The chief application in census and survey work is in the 
sampling of land areas. When maps are available the sampling units can be 
located by superimposing a grid of points, frequently of square, or nearly 
square, pattern. Such a sample may be termed a systematic area sample. 

A systematic area sample differs from a systematic sample from lists mainly 
in the spatial distribution of the sampling units over the material. Most lists 
do not correspond at all exactly, except for major groupings, to any physical 
distribution, and a systematic sample from a list therefore usually approximates 
much more closely to a random sample than does a systematic area sample. 
Different methods of estimating the sampling error are therefore appropriate 
in the two cases. 

In general, provided there are no periodic features, a systematic area sample 
will be rather more accurate than a stratified random sample (with one unit 
per stratum) from strata consisting of rectangular blocks (or cells) whose centres 
are situated at the systematic sampling points. In material in which, the 
variation is of a continuous nature it is impossible to make any accurate estimate 
of the sampling error without taking supplementary sampling points, though 
if there are no periodic features an upper limit can be obtained. 

If the regions near the boundary are likely to differ from the remainder 
of the area, as may be the case if the boundary is a natural one, such as a sea 
coast or a mountain range, it will be best, after locating the sampling grid at 
random, to demarcate the bounding lines of the cells, and sample at random 
the area which is not covered by complete cells, dividing it into equal or 
approximately equal areas and locating one sampling point at random within 
each of these areas. It will be convenient, if possible, to make these cell areas 
equal in area to those of the sampling grid, since equal weight will then be 
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given to all sampling points. The same method of dealing with boundaries 
can be used if the sampling is random within rectangular cells. 

Systematic sampling is entirely unsuited to material which has periodic 
features, but apart from this will generally provide a satisfactory method of 
area sampling. It has the advantage over stratified random sampling from 
blocks that the location of the sampling units is simpler and the results obtained 
provide rather better material for the construction of maps, etc. As in 
systematic sampling from lists, however, the responsibility for the judgment 
that the material is such that systematic sampling will give satisfactory results 
rests with the investigator. 

3.15 Line sampling 

In the sampling of areas certain types of information can be ascertained 
almost as easily for all the points on a line as they can for a set of isolated 
points or areas. In such cases sets of parallel lines or strips may be taken 
as the sampling units. In stratified random line sampling, the area is divided 
into rectangular blocks of convenient length and of such a width that two 
selected sampling lines are included in each block, their location within the 
block being random. If an exact estimate of the sampling error is not required, 
only one line need be included in each block, with correspondingly smaller 
blocks. In systematic line sampling the sample is made up of lines at equal 
intervals. 

Line sampling provides an alternative method to point sampling (Section 3.9) 
for determining the proportions of a given area which are of different types. 
These proportions are given by the ratios of the aggregate of the intercepts 
of the different types. In area determinations of this type, whether by line or 
point sampling, systematic sampling can usually be adopted. The method 
is useful both for area determinations on the ground in undeveloped country 
and for the determination of areas from maps and aerial photographs. The 
method has been much used for forest surveys, where it is known as *' cruising." 

If, instead of obtaining information for all points on each of the lines, a 
sample of such points is taken, the sampling becomes two-stage. If the lines 
and the points on them are both evenly spaced the sampling is equivalent to 
systematic point sampling. 

Line sampling of a somewhat different type is also used in order to determine 
the acreage of agricultural crops in areas which are well provided with roads. 
A route is chosen which covers the whole area as adequately as possible, and 
the lengths bordered by the different crops are measured. A car fitted with 
a special milometer can be used for this purpose. Estimates of the yield near 
harvest time can also be obtained in a similar manner, by stopping the car 
at every xth mile of a given crop, and cutting and harvesting a small area of 
the crop, the area being selected in some systematic manner, such as entering 
the crop a given number of paces at right angles to the road. 

Line sampling of this type docs not provide an unbiased sample, since 
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roads are by no means randomly located with regard to agricultural crops. 
The results from surveys following the same route in successive years may well 
be comparable, however, and with calibration by more exact methods from 
time to time, road cruising may provide a satisfactory method of making rapid 
and inexpensive surveys. Similar methods based on tracks are possible in 
areas with only a sparse road network. 

3.16 The principle of the moving observer 

If counts are required of a collection of individuals who are moving about, 
the ordinary methods of sampling can only be applied with difficulty. Thus, 
to determine the number of people in a crowded street by ordinary methods 
would require the demarcation of a number of small areas in the street and 
the counting of the number of people on each of these areas. The counts 
need not necessarily be simultaneous, but for any one area the number of 
people present at a given moment has to be counted. Unless photographic 
methods are available, or the areas are very small, such counts are extremely 
difficult, since individuals are continually moving into and out of the areas 
and are also moving about within them. 

Equally it is no use stationing observers at fixed points with instructions 
to count passers-by. The number of people in a street will depend not only 
on the numbers passing fixed points but also on the velocity of movement up 
and down the street. If all exits and entrances to the street are covered, and 
there are no people in the street at the start of the counts, the number present 
at any subsequent time can be determined from counts that are continuous 
and without error. In practice, however, errors in counting usually result in 
cumulative errors which invalidate the results. Thus it was found impracticable 
to determine the numbers in a department store by posting people at the doors 
to make counts. 

These difficulties can be overcome by using moving instead of stationary 
observers. To obtain an estimate of the number of people in a street, the 
observer traverses the street in one direction, counting all the people he passes, 
in whichever direction they are moving, and deducting all the people who 
overtake him. He then re-traverses the street in the opposite direction, moving 
at the same speed and counting as before. If this is done the average of the 
two counts gives an estimate of the average number of people in the street 
during the time of the counts. If people are mostly moving in one direction 
the count in this direction will be reduced, but the count in the opposite direction 
will be correspondingly increased. In practice the deductions required for 
those overtaking the observer can be kept small by moving at a speed greater 
than that of the majority of the crowd. 

This method was used to estimate the numbers of people in streets, shops, 
etc., at different times of the day, in order that the adequacy of the provisions 
for public air raid shelters might be tested. It was found that very dense crowds 
in streets and shops could be estimated with surprising ease. Crowded streets 
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were dealt with by teams of two or more observers, moving down the street 
in a transverse line, with each observer counting the people between him and 
the next observer. In the large stores the floor was divided into areas which 
were assigned to the different observers. 

The method is of general application. It can be used, for example, to assess 
populations of insects or animals in a state of movement, provided all individuals 
can be readily seen, and provided the 'passage of the observer does not itself 
influence the movement of the individuals. 

3.17 Interpenetrating samples 

It is often advantageous to take two or more independent samples of a 
given population, using the same sampling procedure for each sample. Such 
samples are called interpenetrating samples. 

Interpenetrating samples are of value if the survey or census has to be 
carried out by successive stages. This is frequently necessary when preliminary 
results are required quickly. Thus in the 1942 Census of Woodlands of 
England and Wales, described in Section 4.25, it was necessary to obtain a 
preliminary estimate of the timber content with very limited field staff within 
six months of the initiation of the survey. The work was therefore planned in 
two interpenetrating samples. Before the field work was commenced, each 
unit of the first sample was subdivided into two, since it became apparent that 
the whole of the first sample could not be completed in the allotted time. 
This further subdivision itself created two interpenetrating samples of a special 
type. By means of this procedure it was possible to provide a preliminary 
estimate of the total timber content of the whole country by the time it was 
required. 

An incidental advantage of interpenetrating samples is that separate and 
independent estimates of the characteristics of the population are furnished. 
The agreement of such estimates is often more convincing to the layman than 
any statement of the sampling error. 

Interpenetrating samples have a further use in that the different samples 
can be assigned to different investigators. Comparison of the results provides 
a check of the investigators against one another. 

To perform the functions outlined above, each of the interpenetrating 
samples must itself provide an adequate sample of the material and must be 
comparable with the other samples in other words the samples must be 
really interpenetrating. If this is not the case the comparisons between the 
different samples will be subject to relatively large errors. If, for instance, 
they are used to test differences between different investigators, the information 
obtained will be of insufficient accuracy to be of any real use. Equally, if one 
sample is used to provide a preliminary estimate, this estimate may well not 
attain the required degree of precision. Thus in an agricultural survey stratified 
by areas such as counties, the division of the counties into two groups, with 
each of the samples confined to one group only, would not be likely to give 
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a satisfactory pair of interpenetrating samples. The separate samples would 
be subject to variation between counties, and would therefore be considerably 
less accurate than the combination of the two, from which variation between 
counties is entirely eliminated. The proper use of interpenetrating samples 
therefore necessitates increased expenditure on travelling. 

3.18 Sampling on successive occasions 

The types of sampling so far discussed are appropriate to a census or 
survey carried out on a single occasion, with the object of determining the 
characteristics of the surveyed population at or about a given point in time. 
If the population is subject to change, a survey carried out on a single occasion, 
however accurate, cannot of itself give any information on the nature or rate 
of such change. In certain types of population extraneous sources of 
information, such as registrations of births and deaths, may be relied on to 
provide information on the changes which the population is undergoing. Even 
in such cases the census must be repeated at intervals, both because of 
inaccuracies in the extraneous information, which may lead to a gradual 
accumulation of errors, and also because the information is rarely of such a 
nature that all aspects of the original census or survey can be kept up-to-date. 
Registration of births and deaths, for example, coupled with figures for 
immigration and emigration, will furnish data for the revision of the total of 
the population but will not enable changes in the population of separate towns 
and districts to be determined. 

In many cases no such extraneous information on the changes that are taking 
place is available, and in such cases provision must be made for periodical 
re-survey if up-to-date information is required. A number of alternatives 
then present themselves*: 

(1) A complete census or survey may be repeated in its original form at 
intervals. 

(2) A sample census or survey may be repeated at intervals, a new sample 
being selected on each occasion without regard to previous samples. 

(3) A sample census or survey may be repeated on the same sample. 

(4) Part of the sample may be replaced on each occasion, the remainder 
being retained. If there are a number of occasions a definite scheme 
of replacement may be followed, e.g. one-third of the sample may be 
replaced, each selected unit being retained (except for the first two 
occasions) for three occasions. 

(5) A re-survey of a sub-sample of the original sample may be made. In 
the case of a complete census this is equivalent to a re-survey of a sample 
of the whole population. 

The following terms are suggested for the last four alternatives : 
(2) independent samples ; (3) fixed sample ; (4) partial replacement ; 
(5) sub-sample. It will be noted that independent samples are formally 
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equivalent to interpenetrating samples, a fixed sample is formally equivalent 
to the observation of different characters (variates) on the same sample, and a 
sub-sample is formally equivalent to a two-phase sample. Only partial 
replacement has no formal equivalent in the types of sampling already described. 
These equivalences are of importance in that the methods of estimation will 
be the same for formally equivalent sampling processes. 

The relative advantages of the various types of procedure depend on the 
relation between the variability of the units and the variability of changes in 
these units as well as on the relative importance of information on the 
population means and on the changes in these means. If, for instance, the units 
are very variable but the changes of all units are similar, accurate information 
on change can most easily be obtained by re-survey of a fixed sample of units ; 
provided always that proper provision is made for new entrants to the population, 
and for the elimination of the disturbance which results from the extinction 
of selected units. If, on the other hand, information on the population means 
is of paramount importance, partial replacement or a sub -sample will usually 
be preferable. A more detailed discussion, in terms of the errors to which the 
various estimates are subject, is given in Section 8.8. 

There are two further points which must be borne in mind in connection 
with sampling on successive occasions. Firstly, repeated re-survey o* the 
same units may be inexpedient, since resistance to the provision of the necessary 
information may be engendered, and secondly, repeated re-survey may result 
in modification of these units relative to the rest of the population. This can 
arise in many ways. In a survey of agricultural practice, for instance, visits 
to farms may result in the farmers concerned improving their practice through 
advice from the investigators : advice which, if asked for, can scarcely be refused. 
A more subtle example is provided by the 1942 Census of Woodlands. In this 
census it was considered that if the subsequent fellings and replantings were 
recorded on the sample areas then surveyed an adequate measure of the changes 
in woodland throughout the country might be obtained over some considerable 
period of time. It has since been suggested that the amount of felling on 
the sample areas may have been affected by the fact that survey information 
was available on these areas and not on others, with the result that these areas 
have been more intensively exploited. 

3.19 Composite sampling schemes 

Simplicity and uniformity of sampling procedure is obviously in general 
desirable, but there are occasions on which different methods of sampling are 
required for different parts of the population. In sampling a human population, 
for instance, some form of area sampling may be most suitable for the rural 
parts of the country, whereas some form of stratified random or systematic 
sampling based on lists of houses may be best in the towns. There is, of 
course, no objection to the use of such composite schemes, provided each part 
fulfils the requirements of good sampling procedure already laid down. 
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3.20 Combination of complete census and sample survey 

It sometimes happens that a complete enumeration of the population can 
easily be made, but that detailed information on the individual units of the 
population can only be obtained by sampling methods. In such cases a complete 
enumeration will often be of value as a frame for the sample census. Thus 
in a census of a human population, a complete enumeration consisting of lists 
of households and of numbers in each household can be made. A sample 
of houses can then be visited by investigators so as to obtain details regarding 
the age, sex, etc., of the occupants. Such a sample census will not only serve 
to provide the required detailed information, but will also provide a partial 
check on the accuracy of the complete enumeration. It will not, however, 
provide a check on omissions from the lists of households. To carry out such 
a check it will be necessary to take a further sample of properly defined areas, 
checking that all the households in the sample areas have been included in 
the full census returns. 

A complete census, even if it is very inaccurate, is also of the greatest 
use in planning a more accurate sample census. In a sample census of a 
human population, for instance, some knowledge of the relative sizes of different 
towns and villages, and of the density of population in rural areas, is essential 
for the proper allocation of resources. Similarly in a census of agriculture, 
knowledge of the amount of cultivated land in different parts of the country 
is necessary if excessive survey of largely uncultivated areas is to be avoided. 

The information provided by an inaccurate complete census can also be 
used to improve the accuracy of a subsequent sample census, by the methods 
applicable to supplementary information which will be given in Chapter 6. 
Here, however, we must proceed with caution. If, for instance, a complete 
census of a human population consistently underestimates the population of 
villages of all sizes by about 10 per cent., the sample census will determine 
the amount of the underestimation and a common adjustment can be made. 
If, however, small villages are underestimated by 20 per cent, and large villages 
by 5 per cent, the application of a common correction will result in the under- 
estimation of the population of small villages and the overestimation of the 
population of large villages. This distortion will be avoided if separate 
corrections are calculated for small and large villages. Unfortunately it is not 
always possible to be certain that all potential disturbances are taken into 
account. These differential inaccuracies are particularly troublesome in that 
they tend to be associated with the administrative areas for which separate 
results are required. The overall results, however, will not be materially 
affected by differential inaccuracies of this kind if the methods of estimation 
given in Chapter 6 are followed. 



CHAPTER 4 

PRACTICAL PROBLEMS ARISING IN THE PLANNING 

OF A SURVEY 

4.1 Questions requiring consideration 

The practical problems encountered in the planning of a sampling 
investigation vary greatly with the type of material and the nature of the 
information that is required. We shall here only concern ourselves with 
problems arising in the conduct of censuses and surveys, such as are required 
in the study of human populations, economic institutions, and agriculture. 
The sampling of batches of material, industrial products, etc., which is 
necessary in manufacturing processes of all kinds, and is broadly categorized 
by the term quality control, presents rather different problems which are not 
discussed in this book. The sampling problems encountered in biological 
research are also omitted from the discussion. 

The questions that require consideration at the planning stage of censuses 
and surveys may be broadly classified as follows : 

(1) Specification of the purposes of the survey. 

(2) Definition of the population, types of institution, or categories of 
material to be covered by the survey. 

(3) Decision on the nature of the information to be collected. 

(4) Decisions on the method of collecting the data, whether by interviewers, 
investigators, mail, etc., and methods of dealing with non-response. 

(5) Choice of frame, or construction of a frame if none is available. 

(6) Choice of sampling unit and type of sample, whether stratified, multi- 
stage, etc., determination of size of sample required, and method of 
selection. 

(7) Decision on whether the survey is to be an isolated one, undertaken 
without intention of repetition, or is to be planned with a view to 
repetition at intervals. 

These questions cannot be considered in isolation one from another. To 
a greater or less extent any decision taken on one question will influence the 
decisions that should be taken on the others. They should therefore be resolved 
jointly, or if independent decisions are made these should at least be regarded 
as tentative and subject to modification until the plan as a whole has been 
finalized. 

Nor can the correct decisions be arrived at without considerable knowledge 
of the nature of the material to be covered, particularly its variability, both as 
a whole and within and between strata of various types. Knowledge is also 
required of the ways in which it is practicable to collect the required information 
with the necessary accuracy. If prior knowledge in these matters is not 
available a pilot or exploratory survey will be necessary. Even if there is 
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adequate knowledge of the statistical properties of the material, pilot surveys 
are frequently advisable in large-scale surveys in order to test and improve 
field procedure and schedules, and to train field workers. 

Questions arising under heads (1), (2), (3), (4) and (7) of the above list 
are common both to complete censuses and surveys and to sample censuses 
and surveys. Even here, however, the problems encountered differ considerably 
in the two cases, owing to the greater scope for the collection of detailed 
information and the execution of complicated observations by the sampling 
method. 

The determination of the items on which information is to be collected, 
the degree of detail to be attempted, and the ways in which the information 
can best be obtained, often constitute the most difficult and crucial part of the 
planning of a survey. No amount of care in the planning of the sampling 
or skill in the analysis will compensate for failure in this respect. A survey 
in which the information collected does not adequately cover the field to be 
investigated at the best provides a partial and incomplete picture, and at 
the worst may be irrelevant or actively misleading. 

Careful consideration must therefore be given at the outset to the purposes 
for which the survey is to be undertaken, the type of information it is proposed 
to collect, and the uses to which the information obtained will be put. In 
the case of large-scale surveys, which are likely to provide information that 
will be of value to a number of different organizations or government depart- 
ments, a detailed statement on these points should be prepared. In this way 
those who are likely to want to make use of the results of the survey will be fully 
apprised of its nature, and can if necessary make suggestions for modifications 
before the survey is begun. 

The statistician who will ultimately be responsible for the analysis and 
presentation of the results should, if possible, be selected and appointed at 
the planning stage. Similarly if the advice of a statistical expert is to be sought, 
this should be done, in the first instance, at the planning stage. This rule 
applies even in the simplest types of census. It frequently happens that such 
censuses are undertaken without any prior consultation with a statistical expert, 
whose advice is only sought when the results have been collected and the 
stage of analysis is reached. 

4.2 Definition of the population 

The categories, or types of material, which require to be included in a 
survey, and its geographical scope, are conditioned in broad outline by the 
purposes of the survey, and by administrative and research requirements 
related to these purposes. Within this broad outline, however, there is often 
a certain amount of latitude, and careful consideration should therefore be 
given to the inclusion or omission of marginal categories, particularly those on 
which the collection of information is likely to be specially difficult, or for which 
an adequate frame is lacking. By excluding unimportant marginal categories 
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the task of collecting the information may often be very materially simplified, 
without seriously reducing the value of the results. 

A census of the human population residing in a given territory, for example, 
should ideally include all individuals present in that territory at a particular 
moment, and in simple censuses an attempt is usually made to attain this 
end. It is often, however, difficult to obtain information on certain minor 
categories, such as nomads. These difficulties occur even in a complete census, 
but are often more marked in the case of a sample census. The question of 
whether such categories may be omitted entirely without serious loss should 
therefore be considered. 

The matter becomes of even greater importance when a human population 
census requiring the collection of detailed and complicated information, is 
undertaken, using skilled investigators making visits to individual members 
of the population. In such cases visits to members of the population with a 
permanent residence, even if they are absent from their residence at certain 
times, are relatively simple, but it is far more difficult to cover the floating 
elements of the population. The conduct of such a census becomes very much 
simpler, therefore, if these latter elements can be omitted, 

In a similar manner, in the case of an agricultural census, the determination 
of the areas of the various crops might ideally require that all areas of the crops 
grown within the boundaries of the territory should be included. It may, 
however, be possible to exclude small areas, such as those found in. gardens 
and holdings of very small size, without seriously reducing the value of the 
information. The agricultural censuses of England and Wales, for example, 
which are based on returns from farmers, exclude all agricultural holdings of 
less than one acre, and do not attempt to take account of crops grown in 
private gardens or allotments, 

The question of whether or not minor categories should be included 
depends mainly on the purposes for which the information is required. A 
case is sometimes made for the inclusion of certain categories on which the 
information is intrinsically of little interest in order to ensure comparability 
with the results of previous censuses or surveys, or with the results of parallel 
surveys in other countries. Comparability within and between statistical 
series is obviously desirable, and lack of it can seriously reduce the value of 
the results, and also increase the labour of statistical analyses and the danger 
that those unfamiliar with the details of the various sources of information 
may draw wrong conclusions. Nevertheless when introducing a radically new 
method of collecting information, such as replacement of a complete census 
by the sampling method, excessive weight should not be given to past practice. 
It should not be forgotten that so-called complete censuses are often in 
themselves subject to errors of various kinds, including lack of completeness, 
and that such errors are often a greater source of disturbance to comparability 
than the omission or inclusion of a few minor categories. If there is any serious 
doubt whether a given category should or should not be included this may be 
regarded as prima fade evidence that the category in question differs in 
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essentials from the other more Important categories. Consequently, if it is 
decided that the category should be included, the results should be kept separate 
so that they can be summarized separately and eliminated froai the final 
estimates if required, or given special treatment in these estimates. If this 
is done for the first one or two surveys of the new type, comparability with 
previous results will be ensured, without preventing the omission of the 
category in subsequent surveys if this ultimately appears desirable. 

The arguments in favour of the adoption of identical definitions in different 
countries in which conditions are radically different are even less strong. 
Categories which are of very minor importance in one country may be of 
great importance in another. Decisions as to their inclusion or omission should 
be taken primarily on the grounds of their importance in the country which is 
being surveyed, without undue regard to definitions designed to ensure formal 
uniformity of world statistics. 

In many cases in which complete omission of unimportant categories 
would not be justified, they can be very conveniently dealt with by some 
special sampling procedure, which may be multi-phase, or may be of an 
entirely different type with different frame and sampling units. Thus in a 
human census, certain of the simpler items of information, which can be 
reliably furnished by neighbours or other members of the household, may be 
collected for absentees abroad, or a sub-sample of these absentees may be taken 
for a follow-up enquiry by more intensive methods. Nomads may be dealt 
with by instituting a supplementary sample census to deal only with this 
category of the population. 

In a sample survey the frame adopted contains its own implicit definitions 
of the categories of material to be covered. If a category is not included in 
the frame it will either have to be omitted entirely or special steps will have to 
be taken to supplement the frame. Definitions of the population should therefore 
be considered in conjunction with the choice of frame. 

4.3 Determination of the details of the information to be collected 

The detailed problems which arise in deciding what information is necessary 
and how it can best be obtained vary widely in surveys covering different 
fields of enquiry and according to whether the results are required primarily 
for administrative or for research purposes. Full discussion of any particular 
case necessarily requires extensive knowledge of the subject as a whole and 
of the particular questions at issue, and would be out of place here, but there 
are certain general points which may be mentioned. 

The basic problem is essentially that of the selection of the most relevant 
items of information or types of observation from all those which it is practicable 
to collect and which might conceivably have a bearing on the matters under 
investigation. This selection must be such that a coherent whole is obtained 
which covers the required field adequately, or if this is not possible at least 
provides information on some relevant part of it. 
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This basic problem is essentially the same in complete censuses and sample 
censuses and surveys, but the problem is more complex in the case of sample 
censuses and surveys, since the items of information that can be collected 
and the observations that can be made are themselves more complex and 

varied. 

The best way of arriving at a satisfactory solution of this basic problem 
is usually as follows. In the first instance, the details of the information 
required to deal with the problems originally propounded are determined. 
The question is then considered whether there are any related problems of 
importance on which this information, possibly supplemented to some 
extent, would throw light. If this is the case the supplementary items of 
information required for the full elucidation of these additional problems should 
be determined. With the whole field mapped out in this way, the practicability 
of obtaining the necessary items of information covering any given set of 
problems can be considered, and final decisions taken in the light of the 
relative importance of these problems and the total load which it is considered 
expedient to place on the investigators and respondents in a single survey. 

The details of this process vary greatly in different types of survey, but 
the general principle to follow in all types of survey is to see that the items 
of information collected form a rounded whole covering a definite subject or 
coherent group of subjects. 

This principle is of particular importance in surveys of the questionnaire 
type on human populations, whether the questionnaires are filled in by the 
respondents themselves, or the information is elicited by field investigators. 
Accurate information can only be obtained in such surveys if full and willing 
co-operation of those providing the information is obtained. The survey must 
therefore have a clear purpose which can be explained to the respondents, 
and the questions asked must be relevant to this purpose. If additional 
questions dealing with unrelated subjects are included, or if the questions 
relating to the main enquiry seem trivial, and do not cover aspects which 
appear of importance to those providing the information, the survey will cease 
to appear as a serious enquiry into a particular subject, and will meet with 
unfavourable reactions, summed up in such terms as " snooping." 

The matter is of importance even in enquiries which require the collection 
of factual information by observation and measurement by the investigators 
themselves, without any co-operation from respondents. If the field 
investigators are not imbued with a sense of the importance of their enquiry, 
and are overloaded with the collection of miscellaneous data, they will not 
give of their best. Occasionally information may be sought on^ points 
unconnected with the main survey if it is urgently needed, and considerable 
expense is thereby saved, e.g. in travelling, but this should be avoided as far 
as possible. 

Occasionally, in cases in which a questionnaire would otherwise be unduly 
long it may be possible to split it into parts, obtaining information on one 
group of items from one set of respondents, and on another group of items 
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from a second set. Certain basic items of information will be required from 
all respondents, and the two sets will form a pair of interpenetrating samples. 
The sampling has also a two-phase structure, the basic information acting as 
first-phase information for both sets. If this procedure is followed, however, 
the relationships between items of information in the two groups can only 
be studied for strata or other suitable domains of study, and not for the 
individual respondents. 

Certain items of information are often required in order to ensure the 
proper interpretation of other items. Thus, for example, if housewives are 
being asked whether they prefer coal, gas or electric cooking, and the reasons 
for their preferences, it is essential to ascertain in some detail what experience 
they have had of methods of cooking other than the one they are now using, 
including the type of apparatus used. If this is not done the answers may be 
more an indication of the effectiveness of an advertising campaign in favour 
of one of the methods, or a condemnation of antiquated pieces of apparatus, 
rather than any reflection of the true relative merits of the different methods. 

Information is also often required on items which, though not of primary 
interest, will act as supplementary information and thereby enable the precision 
of the results to be increased by the appropriate methods of estimation. 

In reaching the decisions on the type of information required, both in broad 
outline and in detail, it is absolutely essential to work in collaboration with 
experts on the subjects which it is proposed to cover. If research or 
administrative experience in the subjects to be covered is lacking, it is fatally 
easy when designing a survey to omit some vital items of information. A simple 
instance of such omission is provided by the 1921 and 1931 Population Censuses 
of the United Kingdom. In these censuses information on age of mother at 
marriage, and total number of children born, which had been obtained in the 
1911 Census, was not asked, with the result that the value of the information 
provided by these censuses for studies on changes of fertility of the population 
has been very seriously reduced. As a result of this lack of information it was 
felt to be necessary to institute a special Family Census in 1946 (Section 4. 10). 
In this instance it can scarcely be that the need for this information was wholly 
overlooked, but insufficient weight must clearly have been given to this aspect 
of census information. 

In addition to direct collaboration with experts in the various subjects, 
the plans for the survey should be circulated at all stages of development to 
the various organizations and individuals who are likely to be interes^d in 
the results. This will usually result in requests for the collection of 
supplementary items of information, some of which may not be necessary for 
the purpose for which the survey was originally planned but which will enable 
the results to be used for other purposes. In this way the usefulness of the 
survey may often be considerably increased. On the other hand, the danger of 
overloading the survey with the collection of miscellaneous items of information 
must be guarded against, and all requests should therefore be very carefully 
reviewed, 
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4.4 Inter-relations of groups of natural units 

If the physical inter-relations between the members of groups ^ of natural 
units of the material under survey are of interest, or if information is required 
for groups of natural units as a whole, then information must be collected 
for such groups as a whole, or at least for pairs of units from such groups. 
Thus if the inter-relations between the different members of a ^household 
require to be studied, it is essential to have information for pairs of individuals 
belonging to the same household, and it is usually best that the information 
should cover the whole of a household. This can be ensured by using 
households or dwellings as sampling units. 

Another type of natural aggregate for which it is often important to obtain 
results as a whole is that provided by towns, villages, etc., and, in agricultural 
surveys, homogeneous geographical areas. This often calls for the adoption 
of multi-stage sampling, the natural aggregates forming the sampling ^ units 
at the first stage, even in cases where the use of single-stage sampling is 
otherwise preferable. Thus in a survey of a human population it may be 
of considerable interest to contrast the results for individual towns of differing 
types, and to study the inter-relations existing within a single town, even 
when there is no need for all the towns of the country to be covered. 

Similarly, if inter-relations between the behaviour of the same individuals 
or other natural units at different times are of interest, the survey must be 
designed so as to provide information covering an adequate period of time. 
Thus in an investigation into hours of sleep of children, it is of little value to 
determine the amount of sleep of a sample of children each for a single day. 
Such data will throw no light on the question of whether children who have 
a short period of sleep on a particular day tend also to go short of sleep on 
other days or are able to make up for this short period by longer periods on 
preceding or following days. In the same way, studies of nutrition in which 
the intake of food is determined for each individual for a single day only, 
although they will show whether a group as a whole is under-nourished, are 
incapable of revealing the degree of variation in under-nourishment between 
individual and individual, since individuals going short of food on a particular 
day may make up for such deficiencies, in whole or in part, on succeeding 
days. 

We have stressed this point at some length because there has been a tendency, 
in surveys on human populations of the questionnaire type, to take the relatively 
easy course of asking those interviewed about occurrences which are still fresh 
in their mind, e.g. what happened on the previous day. This course is followed 
for various reasons. It may be considered that information provided about 
earlier occurrences will be inaccurate, or that there is a danger of overburdening 
respondents if an attempt is made to cover too long a period ; or the object 
may be to save interviewers the trouble of repeated visits which might be 
required to cover a period of time accurately. Actually it has been found that 
the use of a very short period does not necessarily lead to accurate average 
results : in certain circumstances there may be a tendency on the part of 
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respondents, either consciously or unconsciously, to telescope events, and 
report them as happening in the given period when in fact they happened 
earlier. Thus a survey of crockery breakages made by asking what breakages 
occurred over the past week led to an entirely excessive estimate of the amount 
of breakage, whereas a similar survey asking for breakages over the past year 
gave results which checked well with production figures and the domestic 
stocks (Box and Thomas, 1944, D, discussion). 

4.5 Practicability of obtaining the required Information 

So far we have been considering the problem of determining what items 
of information are required in order that the purposes of the survey may be 
fulfilled. Each item must, however, be considered in the light of the 
practicability of obtaining it. If the information is to be furnished in response 
to questions, the points for consideration are whether the respondents are 
sufficiently informed to be capable of giving accurate answers ; whether, if 
the provision of accurate answers involves them in a good deal of work, such 
as consulting previous records, they will be prepared to undertake this work ; 
whether they have motives for concealing the truth, and if so whether they 
will merely refuse to answer, or will give incorrect replies. If the information 
is to be obtained by observation or physical measurement, the points for 
consideration are whether the observations are such that they are within the 
competence of the investigators or other individuals who will be required to 
undertake them ; whether they will make excessive demands on the time of 
the investigators or others, or require excessively expensive apparatus ; and 
whether the owners of the surveyed material will permit the observations to 
be made. 

Considerations of this kind will inevitably lead to modifications of what 
would otherwise be considered an ideal scheme. Nor can general answers be 
given, even within the limits of a particular field of enquiry. In countries 
such as the United Kingdom, for example, there is no reason to suppose that 
any large amount of inaccuracy is introduced into the returns of the population 
censuses by deliberate mis-statements. In countries not accustomed to 
population censuses fear that the information will be used for such purposes 
as taxation or conscription may lead to considerable inaccuracies. Similarly 
in crop-sampling work the use of small sample areas may be quite satisfactory 
with certain classes of field worker, but, as is shown by Table 2.5, is entirely 
unsatisfactory in other cases. 

When the ideal requirements cannot be fully met it is sometimes possible 
to include other items of information, observations, or physical measurements, 
which, owing to their high correlation with the quantities which it is desired 
to determine, will serve as more or less adequate substitutes for these quantities. 
These substitute measures may be used for purposes of stratification or 
classification of the data in the final analysis, as for example when the rateable 
value of a dwelling is used as a substitute for the income of the household 
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occupying it ; or they may be substitutes for measures of quantities which 
themselves require assessment, as for example the use of eye estimates in place 
of direct measurements of the yields of a standing crop. The efficiency of such 
substitute measures can only be properly judged by a proper statistical 
investigation of the relations between them and the quantities for which they 
are substitutes. In the case of substitutes for measures of quantities that are 
to be assessed, some method of calibration is essential if objective estimates 
of the original quantities arc to be obtained. (The calibration of eye estimates 
is discussed in Sections 6.15 and 7.14.) 

It will inevitably happen in certain cases that information which is of 
considerable importance will prove to be unobtainable, or unobtainable with 
sufficient accuracy. When such a situation arises it must be squarely faced. 
There is at times a tendency to attempt to collect information which, because 
of its nature, cannot be obtained with the necessary accuracy, and then 
to condemn the survey method in general because the results are of little 
value. 

This, however, does not mean that the collection of difficult items of 
information should not be attempted. The sample survey procedure, because 
it makes possible the use of skilled investigators working on a relatively small 
sample, is frequently capable of eliciting reliable information on points which 
it would be quite impossible to include in a general enquiry. The fact that 
the enquiry is on a small sample, if known to the respondents, frequently 
makes them willing to give information which they would certainly not be 
prepared to give if the enquiry were general. In such cases it is important 
that the investigators should themselves be recognized as impartial and 
disinterested ; in particular they should not be officials of an organization 
which itself might make use of the information obtained to the detriment 
of the respondents. 

Nevertheless there are subjects on which it is impossible to collect accurate 
information from a random sample of the population. In certain of these 
cases information can be collected from a selected group of individuals, 
e.g. individuals with whom social welfare workers are in contact. Information 
of this type is not necessarily valueless, but it must be clearly recognized that 
it is not the equivalent of information obtained from a random sample of the 
whole population, and any attempted generalization of the results will be of 
limited validity. 

Attempts are sometimes made to obtain a sample from such a group of 
individuals which conforms more closely in certain respects to the population, 
e.g. in classification by age or social class, than does the group as a whole. 
While this may improve the sample somewhat, it still does not provide the 
equivalent of a random sample. On the other hand, if the whole of the group 
is not required, it is usually advisable to apply some rigorous form of selection 
rather than to permit the workers themselves to select individuals for 
investigation, as the latter procedure will merely introduce further unnecessary 
elements of bias. 
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In cases in which some of the items of information are difficult to collect, 
multi-phase sampling may be of value. It may, for instance, enable specially 
skilled investigators to be used for the more difficult items. Thus in a health 
survey medically qualified investigators may be used on a small sub-sample 
of a much larger sample on which more general items of information relating 
to health have been collected. Equally it may be used to reduce the work 
required to manageable proportions. Thus, in the Survey of Fertilizer Practice 
soil samples for chemical analysis were taken from one old-arable field, one 
new-arable field, and one field of permanent grass on each farm, these fields 
being a sub-sample of all the fields on which information on the use of 
fertilizers was obtained (Section 4.23). 

4.6 Methods of collecting the information 

The methods of collecting the information are to a large extent conditioned 
by the material under survey and the type of information required. Where 
the alternative possibilities exist, it may be stated as a general rule that 
observations are preferable to questions, and questions on facts and on past 
actions are preferable to questions on generalities and on hypothetical future 
conduct. Thus it is better to inspect a house to see if it shows signs of damp, 
than to ask the occupant if the house is damp ; and it is better to find out 
what considerations, from among the various alternatives (if any) that presented 
themselves, governed the selection of the house in which the occupant is living, 
rather than to ask what type of dwelling house, flat, bungalow, etc. is 
" preferred." 

On the other hand, it is scarcely possible to state any general rule with 
regard to physical measurements and qualitative observations made by the 
investigator. Physical measurements are more objective, but qualitative 
observations are often more capable of summing up the salient features of a 
complex situation. Thus a qualitative grading by the investigator of the degree 
of dampness of a house is likely to be more effective than any physical measure- 
ments designed to determine the degree of dampness. Moreover, by proper 
standardization and calibration among investigators qualitative observations 
can themselves be made objective. 

When the information is collected by means of a census form or question- 
naire the questions which are to be asked should be considered at the planning 
stage, since the information obtained will depend on the exact form of these 
questions. Equally the exact form of any observations and physical measure- 
ments which are required should be determined. 

Census forms and questionnaires may be designed either for completion 
by the respondent with little or no assistance from investigators, or for 
completion by the investigator by the aid of questions put to the respondents. 
In questionnaires of the latter type the investigators may be instructed to 
ask questions with a given form of wording, or they may be instructed to elicit 
information which will provide an answer to the questions of the questionnaire 
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by enquiry and discussion without adherence to any exact form of words. 
Both means of eliciting information may be required in the same survey for 
different items of information. 

Census forms and questionnaires designed for completion by the respondent 
may be delivered and returned by post, delivered by post and collected by 
an enumerator or investigator, or vice versa, or delivered and collected by an 
investigator. Use of the post is clearly most economical, and is the method 
generally followed in censuses and surveys of industrial and commercial 
organizations, such as censuses of production. In such cases the use of investi- 
gators will not normally have any great advantage over the post, either in ensur- 
ing more complete response or obtaining more accurate information, though 
occasionally in local surveys investigators may be used to explain the purposes 
of the survey and persuade the respondents to co-operate. In population 
censuses, however, investigators arc normally used both in order to ensure 
the maximum response, and to give assistance where necessary in filling up 
the forms. Censuses and surveys of small-scale industrial and commercial 
organizations, and of farms, occupy an intermediate position, and the method 
used will depend to a large extent on local circumstances. 

Attention must be paid to the detailed wording of all questions, even if 
these are only intended as guides to the investigator. If the question itself 
creates a wrong impression in the mind of the investigator this will undoubtedly 
lead to errors, even if additional explanatory notes indicate that something 
else is really required. 

Careful thought must also be given to the order of the questions. If questions 
are arranged in an orderly sequence the investigator's task is much easier, and 
the respondent's reaction is likely to be more favourable. This applies to all 
forms of questionnaire, but is most important in the verbal questionnaire. 

In many types of survey it is profitable to give the investigator or respondent 
an opportunity of making general remarks on special points. This can be done 
very simply by including a space for observations. Some guidance should 
be given on the type of observations required. Although such observations 
do not easily lend themselves to exact analysis they are frequently of considerable 
value in drawing attention to relevant facts not covered by the questionnaire itself. 

The type of investigator to be employed must also be considered. 
Investigators should have a background knowledge of the subject under 
investigation, particularly in investigations of the research type. In a technical 
investigation into housing conditions, for instance, the investigators should 
have some, knowledge of housing construction and of standards normally 
adopted in good practice. This requirement of technical knowledge in the 
investigators limits the scope of unspecialized teams of investigators. Such 
teams are suitable for carrying out ad hoc and routine investigations which 
require only relatively simple questionnaires, but they are no substitute for the 
more specialized teams required for investigations of a research nature involving 
technical questionnaires and observations or physical measurements by the 
investigators themselves. 
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In surveys requiring any high degree of technical knowledge it is usually 
best either to use members of existing organizations, or to appoint a small 
specialized team of technically qualified research investigators. The various 
surveys into the technical and economic aspects of agricultural practice in 
England and Wales, for example, are carried out by the staffs of the National 
Agricultural Advisory Service and the Provincial Advisory Economists. By 
this means teams of investigators are obtained who are technically qualified 
and capable of discussing the problems involved with the farmers ; at the 
same time the investigators themselves gain a wider knowledge of the farms 
of their district which is of value to them in their other work. 

4.7 Methods of dealing with non-response 

Unless non-response is confined to a small proportion of the whole sample 
the results cannot claim any general validity. Every effort must therefore 
be made to reduce non-response to negligible proportions. 

Non-response is usually most serious in postal questionnaires. Delays 
in response can also sometimes be very troublesome, particularly when the 
results are required quickly. A rigorous system of dealing with failure to respond 
and delay in response must therefore be instituted at the outset. The first 
step is to send a follow-up letter, but if this does not produce the required 
effect, the possibility of using more intensive methods such as telephone calls 
and personal visits must be considered. These will require a special regional 
organization. 

In censuses of industrial and commercial undertakings in which data on 
production, sales, labour force, etc. are required for the purposes of economic 
planning it is usually possible to make the returns compulsory. This is often 
a help in dealing with a small minority of recalcitrant institutions, particularly 
if pressure can be brought to bear in other ways, but it is no substitute for full 
and willing co-operation by the majority. Complete population censuses 
are usually also made compulsory, and there appears to be no logical reason 
why sample censuses of the same type should not also be compulsory. While 
this is little help in dealing with obstinate refusals, since the census authorities 
are not likely to wish to bring the offenders before the courts, it is an indication 
that the government regard the census as of importance, and to this extent 
is likely to act as a persuasive force with the waverers. 

In censuses which are to be repeated at intervals it is particularly important 
to deal vigorously with non-response and delay in response at the outset, as 
otherwise they tend to increase progressively. If any large volume of non- 
response persists, or if there is any serious delay in making the returns, it is 
an indication that something Is wrong with the census, which should either 
be reorganized or abandoned. 

In sociological surveys using the interview method the amount of deliberate 
non-response is usually small. If it is not, the questionnaire and the type of 
investigator used should be reviewed. Revisits by special investigators 
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can be tried, but are not likely to be very effective. In technical surveys of 
agriculture involving interviews with the farmers the amount of deliberate 
non-response is also usually small, unless the amount of information required 
is such that it puts too heavy a burden on the respondents. 

In sociological surveys, however, initial non-response due to failure to 
contact the respondent can be very troublesome. There is no proper way 
of dealing with this except by persistent call-backs. The number of call-backs 
can often be reduced by enquiring of neighbours when the respondent is likely 
to be at home, or where he can be found so that an interview can be arranged. 
Call-backs are also required because the respondent, though willing to give 
the information, is otherwise engaged at the time of the first call. 

The amount of work involved in follow-ups and call-backs can be reduced, 
if this appears desirable, by taking a sub-sample of those not contacted at the 
first (or subsequent) call, and weighting up the sub-sample in the final results. 
In repeated censuses, however, complete follow-ups are advantageous in 
encouraging better response to later censuses. 

4.8 The frame 

The whole structure of a sampling survey is to a considerable extent 
determined by the frame. The methods of survey which are suitable for a 
given type of material may be radically different in different territories because 
different types of frame have to be used. Consequently, until particulars 
of the nature and accuracy of the available frames have been obtained, no 
detailed planning of the survey can be undertaken. If no frame exists, the 
construction of a frame suitable for the purposes of the survey may well constitute 
a major part of the work of the survey. 

Frames are subject to various types of defect, which may be broadly 
classified as follows. A frame may be : 

(1) Inaccurate. 

(2) Incomplete. 

(3) Subject to duplication. 

(4) Inadequate. 

(5) Out of date. 

A frame may be termed inaccurate if information about the units listed 
in it or defined by it is inaccurate. The term may also be used to cover the 
listing of units which do not in fact exist. Thus a ration-card list in which 
certain women were incorrectly described as married when they were in fact 
single, or in which certain individuals were included who had died, would be 
inaccurate in these respects. 

A frame may be said to be incomplete when certain units of the material are 
omitted entirely, and be subject to duplication when certain units of the material 
are included more than once. Thus a ration-card list in which certain individuals 
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were not included, and others were Included twice, would be both Incomplete 
and subject to duplication. 

A frame may be termed inadequate when it does not cover all the categories 
of the material which it is desired to include in the survey. Thus a ration-card 
list which did not include the temporary residents In a district would be 
inadequate for a survey of the population of that district in which it was 
necessary to Include such temporary residents. 

A frame, though accurate, complete, and free from duplication at the time 
it was constructed, may no longer be so at the time it is required for use. Such 
frames may be said to be out of date. Errors of all the first three of the above 
types may be introduced through the frame being out of date. 

These different types of defect have very different consequences in the 
defects they introduce into the sampling process. Inaccuracy in the frame, 
in so far as it relates to the selected sampling units, will automatically be 
discovered and corrected as the survey progresses, and consequently will not 
invalidate the results. If the information contained in the frame has been used 
as a basis of stratification, etc. or as supplementary information, inaccuracy 
in this information will result in somewhat lower accuracy in the results, but 
the actual accuracy attained will be assessable from the results themselves. 

Incompleteness in the frame will not be discovered in the course of the 
survey itself, and to the extent to which a frame is incomplete the population 
or material will fail to be covered. Incompleteness is likely to be more serious 
than it appears to be at first sight, since it is often confined to units possessing 
some special characteristics, which may in consequence be seriously under- 
represented in the sample. Duplication has a similar effect, since the dupli- 
cated units will have a double chance of being included in the sample. There 
is the difference, however, that incompleteness cannot be determined or set 
right by an examination of the frame itself, whereas duplication may under 
certain circumstances be detected and corrected by such examination, though 
this will almost always be a tedious operation. If the sampling fraction is 
large and the degree of duplication is also large, the duplication may come 
to light in the course of the survey. Thus, with 5 per cent, duplication and a 
sampling fraction of 1 in 10, two out of every 210 units in the sample will on 
the average constitute a duplicate pair. With a sampling fraction of 1 in 100, 
however, only two out of every 2100 units in the sample will constitute a 
duplicate pair. 

A frame which is inaccurate for certain purposes may be incomplete for 
others. Thus a ration-card list in which some of the single women were 
described as married would be complete, though inaccurate, if used as a frame 
for a survey of all women, but would be incomplete if used as a frame for the 
survey of single women only. Such incompleteness could be remedied by 
taking a sample covering all women, and rejecting those members of the sample 
who were found on investigation to be married. 

Inadequacy of the frame will usually be known before the survey is under- 
taken from the specification of the frame itself. Inadequacy can in general 
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only be dealt with by the construction of a subsidiary frame for the omitted 
categories. 

In actual practice, frames are likely to suffer to a greater or less extent 
from all of the above defects. It is therefore essential at the outset of the survey 
to carry out a careful investigation of any frame it is proposed to use, since 
many defects are not at all apparent until a detailed investigation has been 
made. Such an investigation will naturally commence with a study of the 
administrative machinery by which the frame has been constructed and by 
which it is kept up-to-date, but may also have to include a certain amount 
of field work. 

4.9 Frames suitable for censuses and surveys of human populations 

Human populations have a tendency to aggregate in towns and villages, 
often with very high local densities, which makes any form of area sampling 
based on maps and plans subject to high variability, unless a very elaborate 
sampling procedure is adopted. This is most serious if the total numbers 
are not known, and require to be estimated from the sample, but even the 
proportions of the population falling in different categories will be subject 
to substantial errors, since different classes of the population tend to be concen- 
trated in different areas. 

Three very different types of survey of human populations may be 
distinguished. These are : 

(1) Surveys of the census type, requiring the collection of relatively 
simple facts, but covering the whole population, and capable of giving 
separate results for small administrative areas, 

(2) Surveys covering the whole population of a country, and capable of 
giving reasonably accurate estimates for the whole population, and 
possibly for certain broad subdivisions, but not for small administrative 
areas. Such surveys often involve the collection of more detailed 
and elaborate information than do those of type (1). 

(3) Local surveys covering a particular town or rural area, or a few 
contrasted towns or rural areas, in which no attempt is made to obtain 
a sample which is fully representative of the country as a whole. Such 
surveys almost always involve the collection of detailed information 
by field investigators. They are usually investigations of a research 
nature, and may be precursors of simplified surveys on the same 
problems covering the whole country. 

Surveys of the first type present relatively simple sampling problems, 
and relatively complicated administrative problems. The sampling, since 
it has to cover small subdivisions of the population, must generally be single- 
stage, usually with stratification and a uniform sampling fraction. Surveys 
of the third type are also relatively simple ; since only limited areas have 
to be covered, a one- or two-stage sampling process usually suffices. 
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Surveys of the second type, however, present much more difficult sampling 
problems, and also give much greater scope for increase in efficiency by the use 
of the more elaborate sampling methods. Since results are not required for 
small areas, administrative or other areas can form the first stage of a multi- 
stage process, thus enabling the sampling to be concentrated in relatively 
few areas instead of being spread over the whole country. This condition 
is absolutely necessary when field investigators are used. 

In fully developed areas a good deal of prior information on administrative 
areas is usually available. This often enables the accuracy of sampling at 
the first stage, which in general is the stage which contributes most to sampling 
error, to be substantially improved by the judicious use of stratification, 
supplementary information, etc. The sampling problems of this second type 
of survey are discussed in more detail in Section 4.18. 

Frames suitable for the sampling of human populations may be broadly 
classified as follows : 

(a) Lists of individuals in the population, or parts of it, provided for 
administrative purposes. 

(b) Aggregates of census returns resulting from a complete census. 

(c) Lists of households or dwellings in given areas. 

(d) Town plans. 

(e) Maps of the rural areas. 

(/) Lists of towns, villages, and administrative areas, often with supplemen- 
tary information of various types. 

In the following sections we will give a brief description of the out- 
standing features, from a sampling point of view, of these various types of 
frame. 

4.10 Frames from lists of individuals 

Lists and card indexes of individuals are provided by various adminis- 
trative activities, such as registration of the population, rationing, or even 
a recent census. Such lists are only likely to be complete, accurate, and up- 
to-date if the administrative machinery is very efficient, and there is some 
definite administrative need for the lists to be kept under constant revision. 
Most lists of this type cover the whole of a country on a more or less uniform 
basis, but they are necessarily maintained by local offices, and their accuracy 
may consequently vary in different parts of the country. Even when a list 
is sufficiently accurate to fulfil adequately the administrative purposes for which 
it was designed, it will often be found to have unsuspected defects which make 
it unsuitable as a sampling frame. 

Lists of individuals are not suitable as a frame for sampling households, 
unless the individuals are grouped by households. If addresses are selected 
from a list not so grouped, the probability of selection will be proportional to 
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the size of the household, which is rarely what is required. If such a method 
of selection is for any reason used, the results must be weighted in inverse 
ratio to the number of people in the sampled households included in the list 
(Section 6.16). 

Examples of administrative lists are provided by the National Registration 
and ration-book lists maintained in the United Kingdom. The National 
Register, which was instituted in 1939, has probably always constituted a 
reasonably accurate register of the whole population, but in its early stages, 
at any rate, it was very defective as a local register, owing to the failure of 
individuals to register changes of address. This defect was later rectified 
by establishment of joint offices with the food offices, and insistence that any 
applicant for a new ration book should first have his identity card amended. 
This, however, did not ensure immediate registration of local changes of address, 
since new counterfoils were only required if the removal necessitated change 
of shops. Consequently local changes were often only registered at the time 
of the regular yearly issue of ration books. 

The card index of the ration-book issues necessarily suffered from similar 
defects. Consequently neither of these registers formed a suitable frame for 
the sampling of small administrative areas, such as a single food-office district, 
particularly during the war when movements of population were frequent 
and considerable owing to air raids. On the other hand, they were and are 
capable of serving as a reasonably adequate frame for a sample census of the 
whole population. 

The food-office card index was used as the frame for the 1946 Family 
Census of the United Kingdom. This census was carried out by the Royal 
Commission on Population, with the object of providing, for married women, 
information on age, age at marriage, number and dates of birth of all children, 
and husband's occupation, information which had never previously been 
collected in full. A sample of 1 in 10 of all the married women (including 
those widowed or divorced) was taken, by examining every tenth card, and 
recording the name and address if the card was for a female adult with the 
prefix Mrs. or with no prefix. Unmarried women selected by this process 
were requested to mark the questionnaire " unmarried." Questionnaires 
were dispatched by post, and collected by subsequent visit. 

Since there is necessarily a time lag in cancellation of the old food-office 
card on removal this is effected by notification from the food office issuing 
the new counterfoils special steps had to be taken to deal with removals. 
This was effected by fixing a " zero " date at an interval prior to the date when 
the sample was taken. The interval was chosen so as to be somewhat longer 
than the time taken for notification of change of address to be received at the 
old office. Thus virtually all duplicate cards corresponding to changes of address 
prior to the zero date would have been removed. Registrations effected after 
the zero date were excluded from the sample, and all cancellations bearing 
a date of re-registration subsequent to the zero date were sampled, the new 
address being recorded. It will be seen that by this procedure all individuals 
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entered into the sampling frame once and once only, and the only individuals 
for whom incorrect addresses were recorded were those who had for some 
reason delayed their re-registration, or for whom notification of change of 
address had not yet been received by the old office. This procedure avoided 
all duplication except in the rare event of excessive delay in notification between 
offices, while obviating the necessity of any attempt to construct a fully up- 
to-date non- duplicated index. 

4.11 Frames from complete population censuses 

A complete census, in so far as it really is complete, will automatically 
provide an aggregate of forms which includes all the individuals in the area 
covered by the census. Nevertheless complete censuses, although they would 
appear at first sight to provide very satisfactory frames, have a number of 
defects. A complete census by its very nature can only be carried out at in- 
frequent intervals, e.g. every ten years, and consequently the frame provided 
by such a census is for the greater part of its existence badly out of date. The 
way in which the census information is customarily collected and analysed 
also tends to reduce its utility as a frame, since the information is not readily 
accessible, at least in the early stages during which it is being transferred to 
punched cards. One of the great advantages of the food office register used 
in the Family Census was that the cards could be consulted and sampled 
in the local offices without serious disturbance of the office routine. 

Many of these disadvantages can be overcome if at the time a complete 
census is undertaken arrangements are made to construct a proper master 
sample from which further samples can be drawn as required. The sampling 
unit for such a master sample should be the dwelling, and not the individual 
or household occupying that dwelling at the time of the census. If dwellings 
are adopted as sampling units the master sample will have a much greater 
degree of permanence than would be the case if individuals were used as sampling 
units. Furthermore, for most purposes a sample of households, and not of 
individuals scattered over all households, is required. 

A complete census will provide a very suitable frame for a simultaneous 
sampling census in which more detailed information is collected on a sample 
of the population. This procedure was used in the 1940 census of population 
in the U.S.A. (Stephan et al., 1940, C). In this census supplementary questions 
were asked of 1 in 20 of the individuals included in the complete census, at the 
same time as the main census information was collected. The procedure was 
thus analogous to two-phase sampling, with the exception that the first phase 
constituted the whole population. 

Since the selection of 1 in 20 individuals for the collection of the supplemen- 
tary information was done on the spot by the field investigators, certain very 
rigorous rules had to be instituted in order to avoid bias. The actual procedure 
adopted was as follows. The census forms each contained lines for 80 individuals, 
40 lines on each side. Two of the lines on each side were specially marked. 
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Five different types of form were used, with the marked lines distributed in 
the manner shown IP Table 4.11. 

TABLE 4. LI SAMPLING LINE NUMBERS IN THE 1940 U.S. POPULATION CENSUS, 

AND THEIR PROPORTIONS 



Style 


Proportion 




Line 


numbers 




V 


16 


14 


29 


55 


68 


W 


1 


I 


5 


41 


75 


X 


1 


2 


6 


42 


77 


y 


1 


3 


39 


44 


79 


Z 


1 


4 


40 


40 


80 



The investigators were instructed to enter the names of each family in a 
defined order, and to complete all lines of the form before commencing a new 
form. Actually these instructions were not always adhered to, 31 per cent, 
of the last lines (Nos. 40 and SO) being found to be blank. If the blanks extend 
over the earlier lines, which are not marked on the W-Z forms, but not as far 
as the lines marked on the V form, this will lead to a slight deficiency in the 
proportion in the sample of entries in line 1 and the other lines marked on 
the W-Z forms. This disturbance, however, is only very small, but any tendency 
of the investigators to alter the order in which the names were entered so as 
to secure a suitable person for supplementary questioning could easily give 
rise to more serious biases. 

The danger of this type of bias is always present in this method of sampling, 
and can only be overcome by the most rigorous training of observers, and 
the imposition of rules which determine uniquely the order in which names 
are entered on the list. 



4,12 Frames from. lists of households or dwellings 

Lists of households or dwellings are frequently available from such 
sources as rating offices, electoral registers, etc. Frames based on such lists 
are in many ways preferable to frames based on lists of individuals. As already 
mentioned, in most surveys in which the information is collected by personal 
visit it is advantageous, and often essential, to collect information from all 
members of a household, in other words to use households as sampling units. 

Frames consisting of lists of dwellings also have a much greater degree 
of permanence, being unaffected by movements of the population. Such 
frames, if complete at the time of their construction, will only become incomplete 
to the extent that there is new building, or changes in the use of existing 
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buildings. New building is necessarily a slow process, and the listing of new 
buildings usually presents no serious difficulties. 

Lists of households can generally be utilized to give a frame of dwellings 
by taking as the sampling units the dwellings occupied by the households at 
the time the frame was constructed. Certain special precautions are required 
to ensure the inclusion of dwellings which were unoccupied at the time the 
list was prepared. In a town in which the dwellings are arranged in streets 
and in which the list is also arranged by streets this presents no particular 
difficulty. What has been called the half -open interval can be used. The 
procedure is as follows. When drawing the sample, the dwelling unit appearing 
next in the list to the selected unit is recorded, and the field investigator is 
instructed to see if there is any other unit on the ground between these two units, 
and if so to include that unit in the sample. Thus the field investigator might 
receive the instruction to survey No. 9 in a certain street, with No. 13 as the next 
recorded unit (odds only). If on visiting No. 9 he finds that No. 9A and No. 11 
also exist, these are also surveyed. The even numbers between 9 and 13 are 
not included, since the instruction " odds only" indicates that they lie on the 
opposite side of the street. This procedure is clearly only possible if the list 
is arranged in an order which corresponds to some geographical pattern on 
the s ground. If the list is not so arranged, incompleteness of the frame cannot 
be corrected by the use of the half-open interval or analogous procedure. In 
such cases the frame will have to be amended by other means, and complete 
rearrangement of the list in some geographical order may be necessary. 

An example of the use of this type of frame is provided by some surveys 
carried out during the war in certain towns in the United Kingdom by the 
Ministry of Home Security, to investigate disturbances to the population on 
account of air raids. In the English towns electoral registers were used 
as frames, and in the Scottish towns rating lists were so used. The electoral 
registers consist of printed lists of voters arranged by streets in order of 
dwellings, all voters in one dwelling appearing together. Each dwelling therefore 
has as many entries as there are voters. Consequently, selection of entries in 
the list with equal probability will not give an equal probability of selection 
in the different dwellings. This could have been overcome by subdividing 
the list into dwellings, and basing the sampling on these dwellings. 

As the surveys had to be conducted at considerable speed, delay in selection 
of the sample was avoided by the device of examining every #th entry, and in- 
cluding the dwelling in the sample if the entry referred to the first listed member 
in the dwelling. This introduces a certain additional discrepancy between 
the working sampling fraction, l/#, and the actual fraction of dwellings included 
in the sample, which introduces errors that are appreciable relative to the 
sampling errors for estimates of such quantities as numbers in the population 
obtained by multiplying the sample total by x. Such estimates were not the 
primary concern of these surveys, and consequently no adjustments were 
required. If necessary, errors arising from this cause could have been eliminated 
subsequently by ascertaining the ratio of the number of dwellings included 
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in the sample to the number In the whole register, and treating this ratio as 
the true sampling fraction. 

In this survey the method of the half-open interval was used to deal with 
dwellings not included In the list, and was found to be quite satisfactory. 
Had there been new housing estates not covered by the register these would 
have had to be dealt with separately. 

In certain towns the separate flats of blocks of flats were not listed, and 
therefore presented a difficulty, since the blocks constituted very large 
units whose chance inclusion or exclusion would have materially Increased 
the sampling error. The existence of blocks of flats was, however, always 
apparent from the large number of voters appearing under the same address. 
The blocks were therefore listed, together with other large Institutions, by 
preliminary inspection of the register, and every *th flat was selected by visit 
to all the blocks in turn. 

4.13 Frames provided by town plans 

Town maps and plans provide a useful frame for the sampling of dwellings 
in built-up areas. In some cases there may be detailed maps showing the 
location of all dwellings, but in many cases only street plans, not showing 
any great amount of detail, will be available. 

Any town plan which gives an accurate representation of the streets will 
enable the town to be divided up into " blocks," i.e. areas bounded by streets. 
Such a plan will, therefore, provide a frame for area sampling in which the 
units are blocks. A sample of dwellings can then be obtained by Including 
all the dwellings in the selected blocks. In general, however, the variability 
between block and block is likely to be large even after careful stratification, 
since there is often considerable local segregation of different classes of the 
population. Consequently, two-stage sampling is in general advantageous, 
blocks being taken as the first-stage units and dwellings as the second-stage 
units. 

To obtain a two-stage sample in cases in which the map does not show the 
location of dwellings, it will be necessary to construct the second-stage frame 
for the blocks selected at the first stage by ground survey. This, however, 
is a much lighter task than the construction by ground survey of a frame for 
all dwellings in the city, and can frequently be done in the course of the survey 
itself. 

In towns in which the natural blocks are of very unequal area, groupings 
of the smaller blocks or subdivision of the larger blocks should be performed, 
so as to reduce the within-strata inequalities in area. If little is known about 
the town, it may be advantageous to make a ground survey in order to demarcate 
and stratify the block units. It may even be advantageous to carry out a rough 
preliminary count, or make some other rough estimate of the number of dwellings 
in each block, as this will enable the subsequent selection of the first-stage 
units to be made from within strata with probabilities proportional to the 
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estimated numbers of dwellings. This was done in parts of the Greek population 
census described in Section 4,16. If such preliminary estimates are not 
available the best that can be done is to make a selection with probabilities 
proportional to area, but block areas are unlikely to be very closely correlated 
with the number of dwellings, even within strata. In either case the second- 
stage sampling fractions may be taken inversely proportional to the first-stage 
fractions, so as to give a constant overall sampling fraction. 

Whether an elaborate procedure of this kind is needed depends not only 
on the accuracy required but also on whether estimates of total numbers are 
required from the survey. Since total numbers will be highly correlated with 
numbers of dwellings, prior supplementary information on these numbers 
for the different blocks, even if only rough, will be particularly effective in 
reducing the sampling variability of estimates of total numbers. They are 
not likely to have such large effects on estimates of the proportions of the 
population falling in various categories. 

Sampling by streets- is sometimes used instead of sampling by blocks. 
This is usually not so satisfactory as sampling by blocks, since each block repre- 
sents a clearly defined area, whereas if a street is taken, there is often doubt 
as to exactly what is to be included and what excluded : alleyways and court- 
yards having entrances from more than one street, and not shown on the street 
map, for example, present considerable difficulty if the sampling is by streets. 

Sampling by blocks is particularly valuable for surveys of towns in which 
all types of building have 'to be covered. Second-stage sampling of any or all 
of the different types of building can be adopted if required, by enumerating 
the different types for the selected blocks after the first-stage sample has 
been drawn. 

4.14 Frames provided by maps of rural areas 

The use of maps as a frame for the sampling of rural areas presents some- 
what different problems from those encountered in the sampling of towns 
by the aid of town plans. 

If accurate and detailed maps showing all or virtually all buildings are 
available, rectangular areas may be used as sampling units, the buildings falling 
in the selected areas being examined on the ground to see whether they are 
dwellings, with a further examination for unmapped dwellings. 

Sampling with probability proportional to the apparent number of dwellings 
indicated by the map is possible, but would involve counting the dwellings 
in all the rectangular areas. Consequently it is better, if preliminary work 
on the maps of this magnitude appears to be worth while, to divide the map 
into areas containing approximately equal numbers of dwellings, using natural 
boundaries as far as possible and paying particular attention to stratification. 

It may be noted here that the selection of a point at random on the map 
and selection of the dwelling unit nearest to this point for inclusion in the 
sample a method which is sometimes used is inadmissible, since a unit 
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which is widely separated from other units will have a much greater chance 
of selection thaa one which is close to other units. 

With less detailed maps rectangular areas marked on the map will not be 
capable of being demarcated exactly on the ground. Natural features occurring 
on the maps must therefore be used as boundaries of the sampling units. This 
will necessarily give units of differing size. In particular, occasions will arise 
when it is impossible to subdivide a somewhat large area. ^ In such cases, the 
area in question may be taken to represent two or more units. If any of these 
units are selected, a subdivision into two or more parts as alike ^as possible 
is made on the ground at the time of the survey, and the requisite number 
of parts selected by random choice, 

In most countries, even in rural areas, there will be a number of villages 
of varying sizes, which are best dealt with separately by some form of 
stratification and two-stage sampling, since if these are included in the ^area 
sample a high degree of variability will be introduced. The use of a variable 
sampling fraction at the first stage, a larger proportion of the larger villages 
being selected, will be advantageous. A compensating reduction in the second- 
stage sampling fraction can be made if desired. The boundaries of all villages 
will require careful demarcation, as otherwise there will be ambiguity as to 
what should be included in the area sample. 

4.15 Frames from lists of villages 

In undeveloped areas the available maps are not likely to be of sufficient 
accuracy for area sampling. Where the population is concentrated in villages, 
these usually form the best first-stage sampling units. A list of villages will 
then serve as a suitable frame. 

Even if the majority of the population is concentrated in villages there may 
be a residue located in the intervening countryside. If this residue owes 
allegiance to definite villages, the problem is relatively simple, since all that 
is required is the identification of the individuals belonging to the selected 
villages. This can normally be done by the head-men of these villages. 

If no such association exists, some form of area sample of the intervening 
areas may be necessary. If rough maps are available, suitable areas may^be 
demarcated by tracks, rivers, etc. If no maps are available, some form of line 
sampling may be possible in open areas. 

If the country is not sufficiently open to be easily traversed, the construction 
of a rough map of the tracks may be necessary before any adequate sampling 
of the intervening areas can be carried out. It may be possible, however, to 
use these tracks without full mapping. Thus all tracks leading from villages 
selected for the sample may be traversed, and dwellings to which they give 
access included up to half-way to the next listed, but not necessarily selected, 
village. Such a method will only be effective with a relatively simple track 
system such as is met with in forest areas : intermediate junction points, for 
example, present special problems. 
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416 The 1946 population sample for Greece 

The 1946 population sample for Greece provides a good example of the 
way in which sampling methods of the types discussed in the previous sections 
can be used to obtain speedy census data from a sample covering the whole 
population of a country. (Jessen et al, 1947, C). * 

The sample was taken by the Allied Mission for Observing the Greek 
Election, as part of the investigation of the accuracy of the electoral lists. A 
population sample was required in order to test for omissions from the electoral 
lists, and the opportunity was therefore taken of securing more general census 
data. The corresponding test for duplications and redundancies in the lists 
was made by examining the lists themselves and investigating a sample of 
names drawn from them. 

The frame for the first stage of the population sample was that given by 
the 1940 Population Census. This census gave returns for koinotetes, which 
are small communities or groups of villages, and demoi, which are towns and 
cities, usually with more than 10,000 population. Maps were available which 
showed the areas included in these koinotetes and demoi, and the names and 
location of all the populated centres. The koinotetes and demoi were used as 
sampling units at the first stage of the sampling. The units were stratified 
according to their population in 1940, and a variable sampling fraction was 
used. The actual scheme is shown in Table 4.16. Selection from within 
strata was systematic. 

The sampling of the selected first-stage units was based on lists of house- 
holds within the area. These lists were either based on existing lists checked 
and brought up to date, or were specially prepared to show the location of the 
households on a map. 

For the sampling of towns an additional stage was used, a sample of 
blocks demarcated on an existing or a constructed street plan being first taken, 
with a further sample of houses from within the selected blocks. Sampling 
was sometimes with probability proportional to estimated numbers of house- 
holds, these estimates being obtained by a rapid cruise of the whole area, and 
sometimes with equal probability. 

The sampling fraction at the final stage was in all cases adjusted so as to 
give a constant overall sampling fraction. When blocks were sampled with 
probability proportional to estimated numbers of households, this required 
that the sampling fraction within the selected blocks should be taken as inversely 
proportional to the estimated number of households in the block. Thus the 
parish of Agios Panteleemous, which is given as an example, was initially sub- 
divided into 98 blocks. Before sampling, some of the smaller blocks were 
combined so as to give 65 combined blocks, t The total of the estimated 
number of households was 966. It was decided to sample three blocks, which 
were selected systematically by taking a sampling interval of 322 (= 966/3) 
* See also U.S. Dept. of State publ. 2522 (1946, DO and Jessen et al (1949, D'). 
fThis procedure is the same as that used in the sampling of Hertfordshire farms 
by parishes, Section 3.11. 
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with a random starting point of 288, using sub-totiils in the manner of 
Example 3.2.c. This gave combined blocks containing 23, 18, and 13 
estimated households respectively. Since a sampling fraction of 1/100 within 
the parish was required, the estimated number of households in the sample 
was 966/100 = 10. This number was divided approximately equally between 
the three selected blocks (4, 3, 3), and the sampling intervals were calculated 

TABLE 4.16 GREEK POPULATION CENSUS : SUMMARY OF THE SAMPLE DESIGN 
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For koinotetes 


I 


0- 499 


350 


1/100 


1/5 


2,147 


20 


2 


500- 999 


750 


1/50 


1/10 


2,049 


40 


3 


1,000-4,999 


2,500 


1/20 


1/25 


1,366 


70 


4 


5,000 and over 


7,000 


1/5 


1/100 


54 


10 










TOTALS 


5,610 


140 


For demoi 




6 


Under 25,000 


17,000 


1/2 


1/250 


52 


26 


6 


25,000 and over 





1/1 


1/500 


22 


22 










TOTALS 


74 


48 



by dividing the estimated numbers in the blocks by these numbers, i.e. the 
intervals were taken as 23/4 = 6, etc. This procedure gives the required 
constancy in the overall probability of selection. The actual number of houses 
in the sample will of course differ from 10 if the estimated numbers are in 
error : it is the overall probabilities of selection, not the numbers of houses, 
that must be fixed. 
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The ratio method of estimation was adopted, using the 1940 population 
data as supplementary information. The actual method used was that appropriate 
to sampling without stratification with probability proportional to size of 
unit (Section 6.16), as this was considered to be the most accurate. There 
does, however, appear to be some danger of the introduction of bias by this 
method, and the unbiased method appropriate to a stratified sample with 
variable sampling fraction (Section 6.11) might have been preferable. 

The survey was very successful and achieved high accuracy, the standard 
error of the estimate of the total population being estimated to be 2-1 per 
cent. The field work occupied 65 observer teams, each consisting of an observer, 
an interpreter and a driver, with a jeep, for three weeks. The entire sample 
and the computations were completed in 7 weeks. 

4.17 Master samples 

When a number of surveys covering the same population or aggregate 
of material are likely to be required, it is sometimes advantageous to construct 
a master sample, from which smaller samples can be drawn as required by means 
of a sub-sampling scheme. 

The use of a master sample has a number of advantages. It enables a more 
accurate, complete and adequate frame to be constructed than could be justified 
if the frame were only required for a single survey. It simplifies the selection 
of samples, since in the sub-sampling only the material contained in the master 
sample has to be subjected to the selection process. It enables supplementary 
information to be obtained which is of value in improving the accuracy of the 
rarious surveys. And it enables surveys on the same material to be so planned 
that the same units are not selected an excessive number of times for different 
surveys a matter of some importance when the information is obtained by 
response to questionnaires. 

The most extensive and elaborate master sample so far constructed is the 
master sample for agriculture of the United States of America. The construc- 
tion of this sample <was undertaken by the Statistical Laboratory of Iowa State 
College, in co-operation with the Bureau of Agricultural Economics and the 
Bureau of the Census (King and Jessen, 1945, G). 

The sampling units of this master sample consist of small areas covering 
the whole of the United States. The units have a mean area of about 2-5 
square miles, but vary according to location and other circumstances, the 
mean area per state ranging from 0*71 square miles to 108 square miles. They 
were formed so as to contain on the average 4, 5 or 6 farms, depending on the 
part of the country. One-eighteenth of all the areas were selected for the 
master sample. 

The whole of the land area of the United States was divided into three 
categories, called in the master sample " primary strata." These primary 
strata are (1) the incorporated stratum, (2) the unincorporated stratum, (3) the 
open-country stratum. The incorporated stratum consists of incorporated 

73 3* 



SECT. 4.17 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

cities and towns and unincorporated places regarded as " urban " by the 
Bureau of the Census. The unincorporated stratum consists of all named 
places outside the incorporated areas which have an estimated population 
of 100 or more, and all other areas which appear on the map and have a 
population density of 100 or more persons per square mile. 

The incorporated areas were defined by the corporate boundaries, of which 
the location could be obtained. The unincorporated areas were demarcated 
on the maps so as to give areas as compact as possible, while including every- 
thing that did not appear to be open country. Subject to this, the boundaries 
were chosen so as to be easily identified on the ground. Aerial photographs 
were used in some cases for this work. 

The general highway and transportation maps showed with varying degrees 
of accuracy the location of farms and other dwellings in the open-country 
areas and to some extent in the smaller unincorporated places, and these were 
therefore used to demarcate the actual sampling units of the open-country 
stratum. The procedure was as follows. The numbers of farms and non- 
farm units were first counted in what are termed " count units " from the 
map, A count unit consists of a unit defined by minor civil boundaries or 
natural boundaries, and in general included from 6 to 30 farms. These count 
units were numbered, and the number of farms and the total number of dwellings 
including farms were marked on the map. The number of sampling units into 
which each count unit was to be subdivided was also decided and noted on 
the map ; in making this decision, consideration was given to the prevalence 
of natural boundaries, etc. The data for each count unit were then recorded 
on punched cards, and cumulative totals of the farm count, the dwelling count, 
and the number of sampling units, were tabulated. These cumulative totals 
were used to determine the count units which contained selected units, a random 
number between 1 and 18 being chosen as a starting-point, and the count unit 
containing every 18th sampling unit being selected thereafter. The count 
units containing selected sampling units were next subdivided on the map 
into the specified number of sampling units, the subdivisions being so chosen 
that they could be located on the ground. The units so demarcated were 
then numbered or counted systematically and the appropriate sampling units 
selected. Existing aerial photographs were used extensively for the demarcation 
of boundaries. In cases in which there were no suitable natural boundaries 
on the maps or photographs two or more units were amalgamated, subdivision 
and random selection being subsequently made on the ground if either unit 
was selected. 

Somewhat different procedures, which need not be detailed here, were 
followed for the unincorporated and incorporated strata. For the incorporated 
stratum information was obtained from the Bureau of the Census on numbers 
of farms, etc. 

In its final form the master sample will provide an adequate master sample 
of both farms and population, and also of the land area of the whole of the 
United States. Because the sampling units consist of areas, the frame will 
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remain complete and adequate whatever changes occur in the course of time. 
The supplementary information provided by the number of farms and number 
of dwelling units will naturally become progressively more inaccurate, but 
major changes are likely to take place only in limited areas, and the master 
sample will in the course of its use reveal the extent of these inaccuracies. 
There will, therefore, be no difficulty in revising the sample when it appears 
necessary for those areas of the country where extensive changes have occurred, 
and this will in no way invalidate the existing sample for the rest of the 
country. 

It will be seen that the construction of a master sample of this type is a major 
undertaking, and it should not be assumed that a master sample of the same 
type is necessarily expedient in other countries in which the conditions are 
different. Thus in the United Kingdom the 6 -inch Ordnance Survey maps 
provide an excellent frame for land area surveys, and the register of farms 
which is maintained for the collection of agricultural statistics provides a very 
complete frame of farms. If a master sample for agriculture is ever considered 
necessary, its construction could be based on this register and on the associated 
returns of farmers. The task of construction would therefore be very much 
simpler than would be the case if no such register existed. On the other hand, 
there is a need in the United Kingdom for an adequate master sample for 
localized population surveys. This problem is discussed in the next section. 

4,18 Localized population surveys 

As has already been indicated in Section 4.9, surveys are often required 
which will give reasonably accurate estimates for the country as a whole, but 
not for the separate small administrative districts. Such surveys have to be 
concentrated in a few localities, particularly if they are to be carried out by 
field investigators, since the amount of travelling would otherwise be excessive 
and supervision difficult. They may therefore be termed localized surveys. A 
multi-stage process must be used, the units at the first stage being administrative 
districts or similar areas of such size that each selected unit is capable of being 
covered by a single investigator or a small team of investigators. 

The crux of the problem, therefore, consists of so planning the primary 
stage of the sampling process that the sampling error at this stage is not excessive. 
A secondary consideration, which must not be ignored, is that the within- 
strata comparisons should be sufficiently numerous to furnish a reasonable 
estimate of the sampling error at the first stage. 

The use of stratification is obviously indicated. This stratification must 
in the first instance serve to differentiate between urban and rural areas. 
Consequently the country should be divided into large cities, into smaller 
urban areas, and into rural areas, in a manner somewhat similar to that employed 
in the master sample of the United States. The number of classes required 
will depend on the character of the towns and rural areas. A variable sampling 
fraction will be required in association with this stratification ; for most surveys 
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it will probably be necessary to take all of the very large towns, but a proportion 
only of the Intermediate towns, a smaller proportion of the smaller towns, 
and a still smaller proportion of the rural areas. Regional stratification of the 
smaller towns and rural areas may also be adopted as far as possible in parallel 
with this stratification. 

These two types of stratification by themselves, however, are not likely 
to be entirely adequate for the urban areas, and some further form of stratifica- 
tion may be sought which will ensure (a) reasonably correct proportions of 
areas of different industrial types, and (b) reasonably correct proportions of the 
different social classes. 

The methods by which it may be possible to ensure this will vary greatly 
according to the nature of the country, the type of primary unit that is adopted, 
and the amount of information that is available on these primary units. Ad- 
ministrative areas are usually most suitable from the point of view of the amount 
of readily available information, but they are not always ideal from the sampling 
point of view. As far as the United Kingdom is concerned, administrative 
areas appear to be the only possible type of area which can be used without 
a great deal of preliminary work. They will probably prove reasonably satis- 
factory if the boroughs and urban districts associated with the large towns 
are treated as parts of these towns, and sampled fairly intensively. Thus, 
for example, the sampling of the various parts of London and of its satellite 
suburban towns should be considered as a special problem separate from that 
of the sampling of the smaller towns in other parts of the country. 

The second-stage sampling of the selected first-stage units is not likely 
to present any very serious problems. In the very large towns such as London, 
and in dispersed rural areas, two or more stages are likely to be required to 
avoid excessive travelling. Adjustment of the sampling fraction at the final 
stage to give equal overall sampling fractions is often advisable, since estimates 
can then be rapidly and simply obtained. Provision at the final stage for a 
proper rota of households to be included in the different samples, so as to avoid 
using the same household too frequently, is also of importance. 

Much further research work remains to be done before it can be said with 
certainty whether a sample of this nature covering the United Kingdom is 
likely to be satisfactory for all purposes, or whether samples having a different 
structure will be required for different purposes. The importance of 
investigating the possibility of obtaining such a sample is clear. Without it, 
localized sociological and economic surveys of the general population cannot 
be carried out with any high, and at the same time ascertainable, degree of 
accuracy. 

4.19 The U.S. series of employ meni estimates 

An early example of a localized sample is that set up in the United States 
in 1 939 to provide regular and speedy statistics on unemployment, employment 
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and the labour force (Frankel and Stock, 1942, F). The sample was modified 
and improved in 1943 (Eckler, 1945a, F). 

In the original sample, counties were used as the first-stage sampling units. 
All the 3097 counties of the United States were classified and sampled as 
follows : 

Total No. Percentage of No. of counties 

of counties population in sample 

Cities 9 14 9 

Urban 447 50 28 

Rural 2641 36 27 



3097 100 64 

The 9 city counties relate to the 5 largest cities ; all these 9 counties were 
included in the sample. The urban counties are those with 1930 populations 
of 45,000 and over. In the urban and rural classes a triple stratification, each 
of three strata, was adopted, the bases of the three stratifications being 
population, administrative areas, and percentage unemployed. Divisions 
between each of the main strata were so chosen that approximately equal 
numbers of counties fell in each main stratum. There were thus 27 sub-strata 
for both the urban and rural classes.* One county was selected from each 
of these sub-strata at random, with one exception where two counties were 
selected. 

A further two-stage process was used to sample the urban and rural areas 
within the selected counties. The numbers of households to be selected from 
the various urban and rural areas were allocated on the basis of the census 
population figures for these areas. This led to the gradual development of a 
differential bias between urban and rural areas, owing to a drift of population 
away from the rural areas. 

The results from within a single county were aggregated without any 
weighting. The aggregates were then weighted according to the population 
of the stratum from which they were obtained. 

The within- county sample was changed every 4-6 months. This was a 
compromise between having a constant group of households, which would 
give most accurate estimates of monthly changes, and having new households 
on each occasion so as to avoid repeated visits to the same household. It 
introduced a certain discontinuity into the results, which has been avoided 
in the modified sample by using a proper system of partial replacement of the 
type described in Section 3.18. 

In the modified sample, which included 68 first-stage units, allocation of 
households on the basis of population figures was abandoned. Instead, small 
areas were used as units at the second stage. This eliminates bias resulting 
from population drift. 

* It will be noted that the numbers of the counties in the different sub-strata 
were not by any means equal. 

77 



SECT. 4.20 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

Several other features were also introduced in order to improve the accuracy 
of the results. The stratification was more detailed, selection of the primary 
units was with probability proportiop^l to their populations, and the ratios 
between the numbers of households having certain contrasting characteristics, 
e.g. farm and non-farm, were adjusted in each selected first- stage unit to agree 
with the corresponding ratios in the stratum to which the unit belonged. This 
last procedure is not entirely free from danger of bias. In a unit with a relatively 
small proportion of farm households, for example, those that do occur may 
be expected to be somewhat abnormal owing to proximity to non-farm areas, 
and such abnormal households will consequently receive excessive weight 
in the final results. 

In both these samples only a single unit was selected from each of the 
first-stage strata. Although this unquestionably increases accuracy by permitting 
the use of smaller strata, it has the consequence that no fully valid estimate 
of error is available. The best estimate is that obtained by combining the 
strata in pairs, and this is likely to be somewhat of an overestimate. 



4.20 Frames suitable for special classes of a human population 

Surveys of special classes taken from the whole of a human population 
are often required. If a general frame covering the whole population is 
available, it can be used for a survey of a special class by selecting a sample 
from the whole population, and rejecting those members which do not fall 
in the required class. If the frame itself does not contain the necessary 
information, this will necessitate surveying all units of the sample in order 
to find out which individuals arc to be retained and which rejected. If the 
required class is only a small fraction of the whole population there will be 
a large proportion of rejects, and a disproportionate amount of work is there- 
fore required in such cases. 

Consequently, if a frame covering only the required class or classes is avail- 
able, this should be used in preference to a general frame. In surveys of the 
labour force, for example, it is often possible to utilize unemployment 
insurance registers and similar records. Such frames are often to a certain 
extent inadequate all types of labour may not be included in an unemploy- 
ment insurance scheme, for example but their greater convenience frequently 
outweighs their defects. Occasionally it may be considered advisable to cover 
the excluded classes with lower accuracy by means of a general frame. 

When no partial frame exists a survey undertaken for another purpose 
can sometimes be used to provide one. Thus in a recent survey of the aged 
carried out by the Nuffield Trust in certain towns of the United Kingdom, 
the records of an earlier survey by the Social Survey of the Central Office 
of Information covering all households were used to locate those households 
which contained aged people. 
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4.21 Frames suitable for the survey of economic institutions 

Surveys of economic institutions may be divided into two general classes : 
those covering the whole or a large part of the commercial and industrial 
undertakings of a given town or country, and those covering a single type 
of undertaking or industry. 

In surveys of the former class the use of frames constructed from maps 
or plans is often feasible : thus, in a general survey of factories of a given town 
a town plan may conveniently be used, with area sampling from this plan. 
Since, however, most commercial and industrial undertakings vary greatly 
in size, a variable sampling fraction is often required, a larger proportion of 
the large undertakings being selected. A map does not provide a suitable 
frame for this purpose. Even for general surveys, therefore, it is often advisable 
to use a special frame for the large undertakings, excluding these from the area 
sample, which is used to cover only the smaller undertakings. In a survey 
of the factories of a town, for example, there is usually no great difficulty in 
drawing up a list of the larger factories. If necessary a preliminary ground 
survey, with or without the aid of detailed maps, can be made. 

If a particular type of undertaking or industry requires to be surveyed, 
the use of any form of area sampling will generally be unsatisfactory unless 
the units are small and widely dispersed, as, for example, occurs with retail 
shops. Even here shopping centres will require differentiation from other 
areas. In other cases a list of all the units of the given type of undertaking 
or industry will form a very much more suitable frame. In order that a variable 
sample fraction may be used it is important that such a list should contain 
some indication of the size of the units. If no satisfactory frame of this type exists 
it will often be worth while carrying out a complete census, simply for the 
purpose of constructing a frame and collecting a few basic facts about the 
given type of undertaking or industry. Such a census is usually more effective 
if it is on a compulsory basis. 

When a frame provided by a complete census is required for repeated 
surveys, the problem of keeping it up-to-date must be considered. This is 
usually best effected by keeping a register of the undertakings concerned, 
and making a regulation that requires all changes to be reported. For the 
purposes of sampling, the most important type of change which requires to 
be recorded is that of new entries. Failure to report other changes will merely 
result in inaccuracies in the frame. 



4.22 Market research and opinion surveys 

Market research includes not only investigation into consumer reactions 
to goods and services and to advertising campaigns, but also investigations 
into consumer needs. In the case of consumer reactions information is mainly 
required on opinions, while in investigations of consumer needs factual informa- 
tion will also be required. Market research surveys can therefore be carried 
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out in the same manner as sociological surveys of the questionnaire type, 
using the methods which have already been described. 

This is also true of other surveys of public opinion, but in many opinion 
surveys and also in certain types of investigation into consumer reactions, 
the requirements are somewhat different from those for sociological investiga- 
tions. Speed is often essential, and changes in the percentage of individuals 
holding a given opinion are frequently of more interest than the absolute value 
of the percentage holding that opinion at any one time. 

To meet these requirements, and to reduce costs to a minimum, what is 
known as the quota method of sampling has been developed. This method 
is a variant of purposive selection. Interviewers are given definite quotas of 
people in different social classes, of different age-groups, etc., and are instructed 
to obtain the requisite number of interviews in each quota. Additional 
instructions, which are designed to prevent excessively unrepresentative 
selection within the allotted quotas, may also be given on mode of contact, 
etc. The interviews themselves are sometimes carried out by house-to-house 
visits, sometimes by interviewing people in the streets and other public places, 
and occasionally even by telephone. 

It is clear that, however accurately the quotas are fulfilled, such samples 
cannot be regarded as the equivalent of random samples. Consequently the 
danger of bias is always present, and the quota method must therefore be ruled 
out as a suitable method of investigation for precise enquiries in which unbiased 
results are required. Moreover, if there is a change in conditions, a quota 
sample which has previously adequately reproduced the characteristics of 
the population may cease to do so. Consequently, the fact that a quota system 
has consistently given reliable results over a period of years is no guarantee 
that it will also do so in the future. 

The striking failure of public opinion polls to predict the results of the 
1948 American Presidential Election in contrast with previous successes 
has been explained as being due to a shift of opinion during the closing weeks 
of the campaign, but it is possible that this failure was in part attributable 
to defects in the sampling procedure. It is generally agreed that the votes 
of organized labour influenced the outcome of the election to a much greater 
extent than usual, and it may well be that, in spite of the quota system, the 
samples were very deficient in factory workers and other trade union labour. 
The mere fact that a quota system is designed to give the correct proportion 
of workers, or even of different classes of workers, does not necessarily ensure 
that those included are representative, as regards the way they vote, of the 
workers as a whole. Consequently the results may be biased in elections in 
which the different types of worker vote very differently. 

On the other hand, if used with skill the quota method may give sufficiently 
accurate results in simple enquiries where only general indications of the 
opinions held are required. If the samples are taken in the same manner on 
different occasions, and circumstances remain broadly the same, it may also 
provide a not-too-inaccurate measure of changes of opinion. 
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Apart from the problem of obtaining a representative sample, there is the 
inherent difficulty in opinion surveys that an individual's opinion on a given 
subject is frequently both ill-defined and liable to change. Moreover, on certain 
subjects the respondents may be unwilling to voice their true opinions. 
Opinions are also held with very different degrees of intensity, which there 
is no easy way of measuring. Much of the information provided by public 
opinion polls is therefore of doubtful significance. 

4.23 Frames for agricultural censuses and surveys 

Agricultural censuses and surveys can be carried out in collaboration with 
the farmers, or in certain circumstances by direct observation without contacting 
the farmers. The latter method is in general only applicable to surveys of 
agricultural crops, and then only if all particulars required are ascertainable 
by inspection. For censuses and surveys of livestock the collaboration of the 
farmer is usually necessary, the essential difference being that livestock is 
mobile whereas crops are immobile. Collaboration is also obviously required 
if information relating to the farm as a whole is needed. In many countries 
contact with the farmer is advisable even for crop surveys, because exception 
may well be taken to the examination of a crop without the farmer's permission. 

If a census or survey is to be conducted by contacting the farmer, the farm 
will usually form the sampling unit at some stage of the sampling process. A 
frame covering farms will therefore be required. Such frames are provided 
either by lists of farms, or by some form of area sampling which serves to locate 
the farmhouses. Frames based on maps, etc., which are suitable for the sampling 
of human populations in rural areas (Section 4.14) are equally suitable for 
the sampling of farms. 

If contact with the farmer is not necessary maps can be used directly as a 
frame for crop surveys. Their use for this purpose is discussed in the next 
section. Even in this case, however, farms may well provide the best available 
frame. 

In crop surveys the natural unit for many purposes is the field and not 
the farm. In cases where it appears advisable to obtain information for some 
only of the fields of a farm under a given crop, a further stage will have to be 
introduced into the sampling process. This inevitably results in a somewhat 
complicated sampling structure with different sampling fractions for the different 
parts of the sample, which in turn introduces complications into the analysis 
of the results, at least if unbiased estimates are required. 

An example of this type of survey is provided by the Survey of Fertilizer 
Practice, carried out in various counties of England and Wales from 1942 
onwards (Yates et al., 1944, G). The objects of this survey are to determine 
the way in which farmers manure the different crops, and the relation of this 
manurial practice to the fertilizer requirements of the soil, in so far as these 
can be determined by the current methods of chemical soil analysis. 

The method of selecting the samples is as follows. For each county a 
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systematic sample of farms is selected from the Ministry of Agriculture's 
addressograph list, maiatained for the purpose of collecting the agricultural 
statistics on crop acreages and livestock. This list is arranged alphabetically 
by farmer's name and parish, and shows the total acreages (crops and grass) 
of each farm. A variable sampling fraction is used, with three size-groups, 
about 100 farms being selected from an average county. Larger samples 
are taken from counties which can be subdivided into districts containing 
different types of farming. 

Each selected farm is visited by a field investigator, who is a member of 
the Provincial Advisory Staff. All the fields of the farm are listed in consultation 
with the farmer according to their crops, and also according to whether they 
have been recently ploughed out from grass (new and old arable). In the 
earlier surveys one field of each crop was selected at random from all the 
old-arable fields, and similarly for all the new-arable fields. One permanent 
grass field was also selected. In the later surveys one field in three of each 
crop has been selected from each of these categories. From each group of 
selected fields one old-arable, one new-arable and one permanent grass field 
is selected at random, and soil samples taken for chemical analysis. 

For the selected fields information is obtained from the farmer on the 
cropping over the previous four years, and the amounts and chemical composi- 
tion of the fertilizers, farmyard manure and lime applied in each year of this 
period. In some of the later surveys only a single year has been covered. When 
necessary the fertilizer merchants are consulted in order to obtain information 
on the chemical composition of the fertilizers. 

The methods of analysis adopted in this survey are illustrated in Example 
6.19. 

4,24 Use of maps as frames in agricultural surveys 

If accurate large-scale maps showing the field boundaries are available, 
the point method of sampling is very suitable for crop surveys in which contact 
with the farmer is not necessary. The fields will then act as sampling units, 
and selection will be with probability proportional to size. Provided the whole 
of a selected field is under a single crop, all that is necessary for acreage estimates 
is to ascertain the crop, no determination of area being required (Section 3 , 9). 
If more than one crop is being grown on a selected field, the proportions of 
the area under the different crops must be determined, but eye estimates will 
usually be adequate for this purpose. 

In this type of work two-stage sampling will often be advisable in order 
to save travelling, and also to avoid having to handle an excessive number of 
maps. Thus in the United Kingdom the 6-inch Ordnance Survey quarter- 
sheets (3 miles X 2 miles) might provide suitable first-stage units, a fairly 
dense grid of points being taken over the selected sheets. 

If selection with equal probability of irregularly-shaped areas such as 
fields is required, these areas must each be defined by a single point, such 
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as the most northerly point of the area. The map is then divided into sampling 
units consisting of rectangular areas, a number of which are selected with 
equal probability, the fields with defining points in the selected areas being 
included > in the sample. Only the selected rectangular areas need be 
demarcated. If more convenient, circular areas whose centres are located 
at random (or systematically) may be used in preference to rectangular 
areas. Rectangular areas have the formal advantage that the whole of the area 
is included once and once only in the aggregate of sampling units, but this is 
not of great practical importance. 

It should also be noted that with this method of selection the sampling 
units consist of groups of fields whose defining points are included in a single 
rectangular area, and not the fields themselves. Rectangular areas which 
contain no defining points must be counted as units of zero area. It can be 
shown that when the rectangular areas are small and mostly contain one or 
no defining point, the sampling errors of estimates of crop acreages are greater 
with this method of sampling than with the point method. The point method 
is therefore preferable under these circumstances. 

On the other hand, if the rectangular areas are large relative to the sampled 
fields as will be the case, for example, if whole sheets of a map are surveyed 
the use of defining points in this manner saves splitting fields which are cut 
by the map boundaries. Some slight additional variance will be introduced 
unless the total area of all fields is determined and used as supplementary 
information. 

Maps have been extensively used as frames for the estimation of the 
acreages of crops in surveys conducted by the Calcutta Institute of Statistics 
(Mahalanobis, 1944, A ; 1946, A ; 1940, H ; 1945, H ; 1946, H). The 
method followed is to demarcate square areas located at random on the maps, 
and to survey all fields covered in whole or in part by these areas. The areas 
of the whole and part fields are determined from the maps by measurement. 
This measurement of areas and their subsequent summation might be avoided 
by the use of point sampling : if each square area were replaced by a square 
pattern of 9 or 16 points, for example, it would appear that the loss of accuracy 
would be small (see Example S.ll.c). 

4.25 The 1942 Census of Woodlands 

The 1942 Census of Woodlands covering England and Wales provides an 
example of the use of maps as a frame. The object of the survey was primarily 
to determine the volumes of standing timber of various types in the country, 
and their broad regional location, in order to estimate the amount of available 
home-grown timber and to plan its utilization. 

The sample was initially planned to be taken in two parts, each consisting 
of 5 per cent, of the total land area. The sampling units were 6-inch Ordnance 
Survey quarter-sheets (3x2 miles), systematically located on two inter- 
penetrating 12 x 10-mile rectangular grids, one for each part. All areas of 
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woodland of over 5 acres on the selected quarter-sheets, and 1 in 5 of the areas 
under 5 acres, were surveyed. Areas of woodland cut by the boundary of the 
map were surveyed if their southernmost point was included in the selected 
map, areas subdivided by rides marked on the map being treated as separate 
areas for this purpose. The land areas covered by the selected maps were 
also inspected in the course of the survey to determine any new plantings since 
the map was last revised, thus correcting for any incompleteness in the 

frame. 

The woodland areas were divided by inspection on the ground into " stands/' 
each of which represented a homogeneous area of woodland. The boundaries 
of these stands were demarcated on the maps so that their areas could be 
determined, and a representative plot was chosen from each stand on which 
all or a sample of the trees were measured. Representative plots were used 
instead of random plots because reasonably accurate volume figures for the 
individual stands were required. Control of the bias introduced by the use 
of representative plots was effected by determining the quantities of converted 
timber actually obtained from surveyed stands felled in the course of ordinary 
forestry operations. This procedure served also as a check against any errors 
in the assumed wastages on conversion. A further control by the measurement 
of randomly selected plots on a sub-sample of stands was also planned, but 
was not in fact carried out. 

The total area of woodland was determined independently from the areas 
coloured green on the 1-inch Ordnance Survey sheets (see Example 7.18). 
Errors in the 1-inch sheets were allowed for by comparing the selected 6-inch 
sheets with the 1-inch sheets after survey. The final estimates of volume 
were calculated from the volumes per acre determined from the surveyed 
areas and the total areas determined as above. 

A first estimate of volumes was required within six months of the decision 
to undertake the survey, and it was thought that with the teams available the 
first part of the survey could be completed and the estimates prepared within 
this time. Before field work commenced, however, it became apparent that 
the original programme could not be adhered to. Each selected quarter-sheet 
was therefore roughly divided into two halves as similar as possible, and one 
half of each sheet was selected at random for the first part of the survey, giving 
a 2 per cent, sample of all woodlands in the country. Subsequent calculations 
of the sampling errors showed that this 2| per cent, sample was quite adequate 
for the determination of general policy, which was the first objective of the 

survey. 

The survey was then completed in two further parts, first the remaining 
halves of the first set of quarter-sheets, and secondly the other set of quarter- 
sheets. The three parts therefore consisted of 2| per cent., 2| per cent, and 
5 per cent, respectively of all the woodlands of the country. In addition certain 
heavily wooded areas were completely surveyed. 

The history of this survey demonstrates the extreme flexibility of sampling 
surveys, and the way in which they can be made to yield preliminary results 
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of ascertainable reliability. By the procedure of first surveying a properly 
selected quarter of the whole sample, it was possible to obtain the preliminary 
estimates in the required time in spite of unexpected delays in the commence- 
ment of the survey. 

4.26 Frames for undeveloped areas 

If no accurate maps are available, exact location of previously demarcated 
small sample areas on the ground will be impossible. Alternative methods 
must therefore be employed. 

For completely undeveloped areas such as natural forests the line method 
of sampling is very suitable, provided the terrain and vegetation is such that 
the lines can be followed on given compass bearings without an undue amount 
of deviation. Distances along the lines can be determined by some simple 
measuring device such as a rope, or even by pacing. Where volume measure- 
ments are required small areas can be demarcated at given distances along 
each line. 

Some frame for the location of the lines is necessary. This can often be 
provided by existing mapped roads or other tracks, but it is by no means im- 
possible to construct a secondary frame as the survey proceeds by the use 
of cross traverses, using any available tie-in points. Except where maps are 
to be constructed, no great accuracy in the location of the lines is required, 
since it is only necessary that they be located in an unbiased manner with a 
density which is the same for the different parts of the area, or, if not the same, 
is determinable. 

In areas in which a line on a fixed bearing cannot be followed, any attempt 
at complete and unbiased coverage must necessarily be very expensive. 
Often, however, a sufficiently unbiased sample of natural vegetation will be 
obtained by traversing existing tracks and taking sample areas at suitable 
intervals by offsets at right angles to the tracks. If a map of these tracks is not 
previously available it may be worth constructing one by rope and sound or 
similar rough surveying technique. 

Crop surveys in partially developed areas without adequate maps present 
somewhat different problems. If the cultivated areas are located in the neigh- 
bourhood of villages, a two-stage sampling process will probably be required, 
a sample of villages being taken at the first stage. Since the total area of 
cultivated land associated with a village is likely to be closely correlated with 
the population figures, these (if known) should be treated as supplementary 
information. If not known the feasibility of making a simultaneous population 
census should be considered, since information on cultivated areas will be 
of more value if it can be related to population figures. In this case the sampling 
may well be two-phase, a larger sample being taken for the determination 
of population. 

The survey of the cultivated areas associated with the selected villages 
will require the construction of second-stage frames. If the line or point 
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method of sampling is practicable this is likely to be the simplest method of 
dealing with compact areas of cultivation. Outlying fields will in this case 
have to be enumerated and sampled separately. 

In many cases enumeration of all fields will be the only practicable method. 
The preparation of a sketch map will then be advisable. A certain percentage 
of the enumerated fields can be measured for area, and the crops determined 
if this has not been possible at the mapping stage. If the cropping is known, 
stratification by crop should be made before the selection of the sample for 
area measurements. A frame of this kind may remain serviceable, with some 
revision, over a number of years. It will also serve to locate the samples 
required in a crop estimation scheme. 

4.27 Use of aerial survey photographs 

When no maps are available the possibility of using aerial survey photo- 
graphs as a frame for agricultural and land utilization surveys should be borne 
in mind. Although it is unlikely to be practicable to make an aerial survey 
simply for the purpose of providing a frame for sample surveys, it is often 
possible to utilize a survey that has been undertaken or is contemplated for 
other purposes. 

Any aerial photographs covering the area are likely to provide an adequate 
frame, though the use of aerial survey photographs even for a frame is not 
as simple as it appears at first sight. The mere handling of the photographs 
covering any large area is a somewhat difficult task which demands an adequate 
and properly trained office staff. Moreover, aerial photographs are subject 
to variations of scale (and also distortion) due to tilt and changes of altitude 
of the aircraft. The stated scale is therefore not always correct, and the scale 
sometimes exhibits disconcerting variations even over different parts of the 
same mosaic. The precaution should therefore be taken of checking the scale 
by means of measurements on the ground in a sufficient number of instances 
to make certain that no important source of error is introduced. 

Various methods of sampling can be used in conjunction with aerial 
photographs. If crop acreages have to be determined, point sampling is 
suitable. After the points have been marked on the photographs the fields 
in which these points fall must be identified on the ground and the crops 
growing on them recorded. In order to avoid excessive travelling it will almost 
certainly be worth using a two-stage process, the units at the first stage being 
rectangular areas which can be demarcated on the photographs, with a number 
of points taken within each of the selected areas. 

If line sampling is required, the lines can first be demarcated on the 
photographs, and subsequently surveyed on the ground. In certain 
circumstances it may be possible to make the intercept measurements on the 
photographs, using the ground survey merely to determine the characteristics 
of the various intercepts. 
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If areas such as fields, the boundaries of which are recognizable on the 
photographs, are to constitute the sampling units, they may be selected with 
probabilities proportional to their sizes by point sampling. If there are likely 
to be ambiguities in the definition of the boundaries the units should be 
demarcated before selection. 

If natural units such as farmhouses which depend on point locations are 
to be selected, small rectangular or circular areas may be used as sampling 
units in the same manner as in selection from a map. 

In certain cases aerial photographs may provide the necessary information 
without any ground survey work. It is usually possible, for example, to 
recognize cultivated areas on the photographs, and the total cultivated area 
may consequently be determined directly from the photographs. In certain 
cases it may even be possible to differentiate between the different crops. In 
these cases the total cultivated area and the proportions of the area under the 
different crops can be determined by sampling of the photographs, point or 
line sampling being used as convenient. If desired, adjustments for variations 
in scale can be made by varying the spacing of the points or lines. 

In some cases the differentiation between the different crops on the 
photographs may be only partial, or subject to error. In such cases a sub-sample 
of the points classified on the photographs can be re-classified by ground 
survey. The information provided by the photographic classification will then 
serve as supplementary information. By this procedure the amount of ground 
survey necessary may be very considerably reduced. The examination of 
stereo-pairs may be a considerable aid to the classification of certain types of 
area, particularly forest areas. 

.If an aerial survey is specially undertaken for the purpose of a sample 
census or survey, it is possible to reduce the amount of photography by taking 
parallel strips of photographs separated by unphotographed areas, but aerial 
photographs taken in this manner will not be of much use for mapping purposes. 
If no map frame is available, a few cross-strips will have to be taken to provide 
links between the separate strips. Too much reliance must not be placed on 
the accuracy of the location of the strips unless special navigational aids are 
installed. 

4.28 Crop estimation 

The total yield of a crop can be regarded as the product of its acreage and 
the mean yield per acre. These two quantities may therefore be estimated 
separately. Estimates and forecasts of the mean yield per acre must of course 
be related to the conventions adopted in the estimation of acreage, particularly 
with regard to areas on which the crop has failed or been abandoned. 

The estimation of acreage has already been discussed in the preceding 
sections, and in this section we shall therefore mainly be concerned with the 
problem of the estimation of the mean yield per acre. 

There are a number of ways in which estimates of the mean yield per acre 
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of a crop, or the total yield, may be obtained. These may be broadly classified 
as follows : 

(1) Reports from crop-reporters, who, at or subsequent to harvest, make 
returns to a central authority of their estimates of the average yields 
of the crop in their own districts, these estimates being based in the 
main on general impressions, discussions with farmers, etc. 

(2) The harvesting of small sample areas of the crop immediately prior 
to the main harvest. 

(3) Eye estimates of the yields of a sample of fields, with subsequent 
calibration of these eye estimates by comparison with the actual yields 
of some at least of the sample fields. 

(4) Co-operation with the farmers at harvest time so that accurate yield 
figures may be obtained from a sample of fields as they are harvested. 

(5) Returns by farmers of the yields of their crops. 

(6) Market returns, export statistics, etc. 

If necessary, Methods (2) and (3) may be combined in a two-phase sampling 
scheme, eye estimates being taken from a comparatively large sample of fields, 
with crop-cutting samples from a smaller sub-sample of these fields. 

These various methods all have their advantages and disadvantages. 
Method (1), that of crop-reporters, is the one commonly adopted by countries 
with long- established and stable systems of agriculture. Its success depends 
on the ability of the individual crop-reporters to make accurate and unbiased 
estimates of the average yields of their districts. The method is not objective, 
and no assessment of its accuracy can be made unless independent estimates, 
provided by some other method of known or ascertainable accuracy, are 
available for comparison. Doubt is often cast on estimates provided by the 
method because of disagreement with market returns, etc., and their lack of 
objectivity makes it impossible to say which set of estimates is at fault. 

Even if crop-reporters are reasonably accurate on the average over a run 
of years, estimates for particular years or particular districts may be subject 
to considerable errors. There seems to be a general tendency, for instance, 
to underestimate yields in good years and overestimate them in bad years. 
The accuracy attained may also be very different for the different crops. 
Moreover, spurious long-term trends may be introduced through gradual 
changes in the standards of the reporters, and this considerably reduces the 
value of the estimates as a measure of the improvement or deterioration of the 
agriculture of a country. Any sudden change in an agricultural system, such 
as the introduction of new varieties, or the bringing into cultivation of new 
land, may introduce disturbances into previously satisfactory estimates. 

Method (2), the harvesting of small sample areas, is theoretically capable of 
providing a completely objective estimate of the mean yield per acre of the 
standing crop at harvest time ; it will not, of itself, provide any estimate of 
the losses at or subsequent to harvest. In practice, however, serious bias 
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may arise in a number of ways if proper precautions are not taken. These 
sources of bias, and the practical details of the method, are discussed further 
in the next section. 

Method (3), that of eye estimates, has the advantage that on certain types 
of crop such estimates can be relatively rapidly made, and consequently a larger 
sample of fields can be visited in a given time. The difficulty of having to 
transport and thresh a large number of samples, which often arises with 
Method (2) is also avoided. Some results of a trial of this method on wheat 
are given in Example 6.15. The method is not suitable for root crops such 
as sugar beet and potatoes, since it is difficult to judge the yields from inspection 
of the tops. In such crops, however, there are no transport and threshing 
problems, since the samples can be weighed in the field. 

If calibration of the eye estimates from the farmers* yields is not practicable, 
or if the calibration is found to vary substantially from year to year, a few 
specially- trained field workers can be used to take crop-cutting samples in 
order to calibrate the eye estimates of each investigator at the time of harvest. 

Methods (4) and (5) require the co-operation of the farmer. Method (4) 
differs from Method (5) in that in Method (4) the harvesting is done in the 
presence of an investigator, and if necessary with assistance, such as the 
provision of a threshing machine, whereas in Method (5) reliance is placed 
entirely on the farmer to provide accurate yield figures. Owing to delays of 
threshing, etc., Method (5) is not likely to provide estimates till some time 
after harvest. 

Estimates from market returns, export statistics, etc. (Method 6) provide 
a useful basis for comparison with estimates by other methods, but such returns 
will only exceptionally give an accurate estimate of the actual yields, since 
the amount of j the crop passing through the market is likely to vary very 
considerably in different circumstances. 

In Methods (2) and (3), which require field investigators, the question must 
be considered whether the survey should cover the whole of the country or 
whether it should be confined to certain districts only, using a two-stage 
sampling process. If only an estimate of the yield of the whole country or 
of large districts is required, comparatively few fields will need to be sampled, 
and a single-stage process for the selection of fields will result in a very 
dispersed sample. A two-stage process will avoid this difficulty at the cost of 
introducing a between- districts component of variation into the sampling error. 
1 Finally, it should be emphasized that crop estimation, though theoretically 
simple, presents many practical difficulties. The introduction of a satisfactory 
scheme where none exists, or the provision of objective estimates to check 
existing subjective estimates, requires continuous work over a number of years 
by a properly established team of workers. Except for preliminary investigations, 
crop-estimation projects should therefore not be undertaken unless continuity 
can be maintained. Nor should an existing method of estimation be abandoned 
or disturbed until a better alternative has been evolved and kept in operation 
for some time. If the old and new methods are run in parallel for a number 

89 



SECT. 4.29 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

of years it will be possible to assess the reliability of the old method and its 
degree of bias, a task which will be quite impossible if it is discontinued before 
adequate comparative data have been obtained. 

4.29 Estimation of yield by the harvesting of sample areas 

As mentioned in the previous section, the estimation of the mean yield 
per acre of an agricultural crop by the harvesting of small sample areas presents 
many practical difficulties, and the results may be biased in a number of ways 
if the proper precautions are not taken. 

Errors can occur through faulty selection of the fields, through faulty 
sampling of the selected fields, through failure to take samples from the field*? 
at dates sufficiently near harvest, or through failure to sample some of the 
selected fields owing to their having been harvested before they were 
visited. 

If rigorous means of selection are employed there is no reason why the 
selection of the fields should be faulty. If, however, the cruise method is used, 
fields being taken at equal distances along a given route in the manner described 
in Section 3.15, the estimate will almost certainly be appreciably biased, 
though this bias may be reasonably constant from year to year if the same 
route is followed each year. On the other hand, the use of the cruise method 
overcomes the difficulty of ensuring that the visits to the fields are made 
sufficiently near harvest, and also eliminates the risk of missing fields through 
their having already been harvested. All that is necessary is to traverse the 
route at sufficiently close intervals of time, stopping the car at each sample 
point and examining the crop to see if it has reached a sufficiently mature stage 
for a sample to be taken. 

An alternative procedure which is sometimes used is to follow the prescribed 
route and take a sample from all or a given fraction of the fields that are actually 
being harvested. This, however, may introduce an additional component of 
bias, since, unless special precautions are taken, limitations of time will result 
in the inclusion of a greater proportion of the fields which are harvested very 
early or very late. 

With crops that do not have to be fully mature at harvest, e.g. potatoes, 
samples will normally have to be taken somewhat before maturity, unless 
information is available from the farmers as to when they intend to lift. With 
such crops, however, it is usually possible to estimate the amount of growth 
between the time of taking the sample and date of harvest ; this latter date 
can then be determined by a subsequent visit. In the potato crop, for example, 
investigation has shown that the weight of tops provides a fair indication of 
the amount of further growth that may be expected. 

The cruise method of sampling, therefore, provides a method of crop 
estimation which, though theoretically more liable to bias than a proper random 
selection of fields, may in practice give more satisfactory results, particularly 
in the estimation of yields per acre. It is also likely to be considerably more 
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economical in travel. Which method is most suitable will depend largely on 
local conditions, and must be the subject of local investigation. 

Bias in the estimation of the yields of the actual fields can arise from 
improper location of the samples and from cutting a larger area of the crop 
than the true unit area. An example of such bias has already been given in 
Section 2.5. 

Edge effects are also liable to give rise to bias, since in an irregularly shaped 
field it is impossible without a great deal of labour to locate samples properly 
at random over the whole of the area. The method described in Example 3 . 2 . b 
is clearly impracticable, and no simple method of traversing the field has been 
devised which will give equal probability of selection over the whole field. 
In practice, however, a systematic method of selecting the sample is quite 
adequate. The important thing is to see that the location of the sample units 
is as objective as possible. 

The determination of the bias arising from headlands, lower yields at the 
edges of the field, and errors in estimation of the area of the field the U.K. 
Ordnance Survey areas, for example, include farm roads, hedges and ditches 
can be made if required by more rigorous supplementary observations on a 
small percentage of the fields. Often, however, the separate determination of 
these components of bias is of no great practical interest, since the losses at 
and after harvest will also affect the total amount of the crop that is finally 
available for consumption, and the total bias is best determined by comparison 
with the farmers* reported yields on a sub-sample of the fields, or by determining 
these yields in co-operation with the farmer. 

Two methods of locating the sample units have been found convenient 
in practice in this country. The first is to traverse the field diagonally from 
corner to corner, using one or both diagonals, and locating samples at equal 
intervals along these diagonal lines. The interval required can be calculated 
by pacing the diagonal or making an eye estimate of the number of paces. 
Errors in the eye estimates are of little consequence, since the exact number 
of sampling units is immaterial. Alternatively, if the crop is in rows the field 
can be traversed along the rows. The length of one end is paced from corner A 
to corner B, the field being entered at a distance of one-quarter of this length 
from corner B. It is then traversed along this row to the other end of the 
field, and a return row is selected by the same procedure. A suitable number 
of sampling units is taken at each traverse in the same manner as in the case 
of a diagonal traverse. In this method of sampling it is advisable to step 
laterally across a given small number of rows after each sampling unit has 
been taken, since a given row may fall wholly on a particularly good or bad 
strip of the field, e.g. on a ploughman's " land." 

Whatever the exact method of traversing the field, it is of the utmost 
importance that the location of the rows and of the sampling units should 
be made without inspection of the crop in the neighbourhood. This can be 
done quite effectively by counting paces and digging in the heel when the 
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requisite number of paces have been taken, but the field workers must be 
thoroughly trained in this procedure. 

A. good deal of work has been done on the most suitable size and shape of 
the sampling units. In this country experimental tests have shown that 4-6 
contiguous quarter-metre row lengths are suitable for cereal crops, with 6-10 
anits per field ; for potatoes 4 units each of 6 ft. of row are now being tested 
Dn a large scale. Mahalanobis (1946, A), working in India, has used three or 
four concentric circles of 2-8 ft. radius, each annular ring being harvested 
separately so as to provide a control of field workers with ordinary workers 
a bias is regularly found with the smallest circleand this gives a check that 
the samples have really been taken in the prescribed manner. 

The best size and shape of the sampling units depends very much on the 
nature of the crop and local conditions, such as type of field worker, whether 
the crop is sown or planted in rows or broadcast, variability within fields, 
available equipment for threshing and transport, etc. Local investigation should 
therefore always be undertaken if any extensive work is contemplated. On 
the other hand, it should be recognized that the sampling error of individual 
fields is usually small relative to the variation from field to field, and consequently 
the introduction of a crop-estimation scheme need not await the results of such 
investigation ; any reasonably efficient method will give satisfactory results 
provided bias is avoided. (See Section 8.12.) 

4.30 Crop forecasting 

The term forecast is here used to denote an estimate of the yield of a crop 
furnished at some date well before harvest. The term is sometimes used to 
indicate estimates made by crop reporters at or even shortly after harvest, 
since such estimates are usually subject to later revision in the light of information 
received from farmers. Such estimates, however, arc better termed preliminary 
estimates, in contrast to the revised or final estimates. 

There is some confusion also between forecasts and estimates of acreages, 
forecasts of mean yields per acre, and forecasts of total yields. Once the crop 
is sown the determination of the acreage, apart from crop failures, is a matter 
of estimation and not of forecasting, and any forecast of the total yield is usually 
best presented in the form of an estimate of the total acreage and a forecast 
of the mean yield per acre. 

There are three main methods of crop forecasting. Forecasts can be 
provided by crop reporters, they can be based on meteorological data, such as 
rainfall obtained prior to the date of the forecast, or they can be based on 
observations and physical measurements of the growing crop, alone or in 
conjunction with meteorological data. 

Meteorological data do not directly provide forecasts of the yields^ If they 
are to be used as a basis of such forecasts, reliable data on both the yields and 
the meteorological events must be collected over a number of years, and the 
crop-weather relations evaluated. The same is true if observations and 



92 



PROBLEMS ARISING AT THE PLANNING STAGE SECT. 4.30 

measurements on the growing crop are to be used. The evaluation of these 
relations requires the application of the method of statistical analysis known 
as " regression analysis." We shall not describe this method here, but it may 
be well to emphasize that its application is not entirely simple, and the advice 
of a mathematical statistician experienced in this type of work should therefore 
be sought. 

It must not be assumed that it will be possible to evolve a prediction formula 
which will give satisfactory results, even if accurate and extensive data 
are available. In the first place, the yield of a given crop is influenced by 
meteorological and other events up to and sometimes after harvest, and this 
may introduce too great a degree of uncertainty into yields predicted some 
months before harvest to make the prediction of any value. In the second place, 
although meteorological factors undoubtedly account for a good deal of the 
variation in crop yields, they are not by any means the only factors. Changes 
in variety, insect pests, plant diseases, exhaustion of the fertility of the soil, 
changes in the type of land under crop, changes in the amount of fertilizers, 
and many other factors may also exert a major influence. Thirdly, meteorological 
effects are often somewhat complex, and it may therefore be impossible to 
determine them from a set of data extending over a limited number of years ; 
owing to the similarity of weather conditions over large areas, data from a 
number of districts in any one year are only a partial substitute for data extending 
over a number of years. 

One advantage of using measurements of crop growth instead of relying 
wholly on meteorological observations is that the crop is thereby used as its 
own integrator of meteorological and other effects up to the time of the 
measurements. Frost and flood damage, for instance, are clearly better assessed, 
once they have occurred, by survey of the crop than by examination of 
meteorological records. The selection of the particular types of observations 
and measurements which are likely to give an adequate basis for forecasting 
is a problem on which further scientific research is required, particularly in 
the case of grain and other seed crops. In the case of root crops investigation 
has already shown that the amount of growth made by the tubers or roots, 
coupled with some measure of the extent to which the plant is still growing, 
e.g. weight of tops, are likely to give satisfactory results. 

Since the evolution of a satisfactory method of crop forecasting demands 
a knowledge of the actual yields over a period of years, an investigation of 
suitable methods can be combined with an objective crop-estimation scheme. 
Once the observations and physical measurements which are likely to give 
useful information have been decided, all that is necessary is to take these 
measurements on a sub-sample of the fields which will subsequently be selected 
for sampling at harvest. In the initial stages it may be better to carry out the 
observations on a special sample of fields, rather than on the more scattered 
sample which will be suitable for crop estimation proper. More intensive 
investigations can also be carried out on experimental plots on which different 
varieties are sown, and which are subject to different cultural treatments and 
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sowing dates. Experimental plots by themselves, however, are not likely to 
provide all the information required for the evolution of a suitable forecasting 
scheme, since the variation from field to field in a given district is often quite 
large, and the inclusion of a number of fields in the district usually gives a 
much more adequate representation of the average meteorological effects in 
that district than will a single field. 

If observations and measurements are to be made on the growing crop, 
a sampling scheme will have to be devised in order that single plants or small 
areas may be selected for measurement. A method of selecting wheat shoots 
for height measurements, for example, is described in Section 2 A. The 
principles to be followed in the location of the sampling units in the fields 
or experimental plots arc similar to those which operate in the selection of 
samples for yield estimates. 

4.31 Determination of the size of sample when the sample is fully 
random 

As has been indicated in Chapter 3, the size of sample required to achieve 
a given accuracy depends on the variability of the material and the extent 
to which it is possible to eliminate the different components of this variability 
from the sampling error. In this and the following section we will describe 
the procedure which is appropriate for determining the size of a random 
sample, and indicate the general relationship between the errors of a random 
sample and other types of sample. Detailed consideration of the more involved 
types of sampling must be deferred till Chapter 8, where the comparative 
accuracy of the various types of sampling is discussed. 

In the discussion of sample size we shall require the concept of standard 
error. As already explained in Section 3.7, the sampling standard error of 
an estimate is a measure of the average magnitude of the random sampling 
error to be expected in that estimate. It also provides an indication of the 
frequency with which errors of various magnitudes may be expected to occur 
(Section 7 . 3). In rough general terms, one- third of the actual sampling errors 
will be greater than the standard error, and one-twentieth will be greater than 
twice the standard error. 

In the case of a fully random sample from a large population the formula 
for the standard error of the estimate of the proportion of units of a given 
type, i.e. having a given attribute, is very simple.* If p is the proportion of 
units of the given type in the whole population, and q = 1 p is the proportion 
not of the given type, the standard error of the proportion of units of the 
given type in a random sample of n units (which provides an estimate p of p) is 
given by 

VP Q 


* The adjustment for finite sampling, required when the population is not large 
relative to the sample, is given in Section 8.1. 
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The formula holds unchanged if the proportions are replaced by percentages. 

Thus 

standard error of (p%) = J (P%) (q%) 

\ n 

The full line of Fig. 7 . 4 provides a graphical representation of the standard 
errors given by this formula. If 20 per cent, of the units of the population 
are of the given type, for example, the standard error of the percentage of 
units in a sample of 100 is 

20 X 80 , 



which is the value given by the full line. Thus estimates in the range 
20 4 per cent., i.e. between 16 and 24 per cent., will be obtained in two-thirds 
of all samples of 100 units, and estimates in the range 20 (2 x 4) per cent., 
i.e. between 12 and 28 per cent., will be obtained in nineteen- twentieths of all 
samples. If a sample of 1000 is taken, the standard error will be 1-26, and 
estimates between 18-7 and 21 3 per cent, will be obtained in two-thirds of 
the samples (Section 7.4). 

When dealing with estimates of quantities such as means and totals it is 
best for our present purpose to work in terms of the percentage standard 
errors. The percentage standard error of the estimate of a quantity is the 
standard error of the estimate expressed as a percentage of the true value of 
the quantity. 

The percentage standard error of the estimate of the total number of units 
having the given attribute in the population is the same as the percentage 
standard error of p. From the above formula we see that this percentage 
standard error is given by 

percentage standard error = 100 A/ 1 = 100 






The percentage standard errors given by this formula are shown by the dotted 
line in Fig. 7.4. 

In a population in which 20 per cent, of the units possess the given attribute, 
for instance, the percentage standard error with a sample of 100 is* 

/ 80 

100 Vio^lo - 20 P fir cent ' 

If there are 10,000 units in the whole population, 2000 will possess the given 
attribute. The standard error of an estimate of this number from a sample 
of 100 will therefore be 2000 X 20/100 = 400, i.e. in two-thirds of the samples 
estimates between 1600 and 2400 will be obtained. 

* This result can also be obtained directly from the actual standard error of the 
percentage, which is 100 X 4/20 i.e. 20 per cent. 
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The formula may be re-written so as to give the number n required for 
the sample when the required standard error of p or the required percentage 
standard error is known. We have 



PH 

(required standard error of p) 2 

10,000 q 

p (required percentage standard error) 2 



The proportions p, q and p may be replaced by the corresponding percentages 
without change. 

If, for example, we are sampling a population in which it is believed that 
about 20 per cent, of the units are of a given type, and it is required to determine 
this percentage with a standard error of 1 per cent. (i.e. a percentage standard 
error of 5 per cent.), we shall require to take a sample with a number of units 
given by 

20 X 80 10,000 x 80 



n = 



I 2 20 x 5 2 



= 1600 



These formulae hold only for a random sample in which the sampling units 
are the units for which the proportion having a given attribute requires to be 
estimated. In sampling a human population, for example, the sampling unit 
may be the household and not the individual. In this case the standard errors 
of proportions of individuals determined from the sample will be larger 
often substantially larger than those given by putting n equal to the number 
of individuals in the above formulae (Section 7.7 and Example 7.8.b). 

The corresponding formulas for a quantitative character are equally simple. 
We then have for the estimate of the mean or total from a random sample 
of n from a large population 

percentage standard deviation of a unit 
percentage standard error -. 

(percentage standard deviation of a unit) 2 
~~" (required percentage standard error) 2 

In order to use these formulae we require an estimate s of the standard 
deviation (or standard error) of a unit in percentage terms. The standard 
deviation of a unit is the measure of the variability of the units amongst 
themselves, and can be determined from a frequency distribution of the unit 
values. Table 7 . 1 provides an example of such a frequency distribution, and 
Example T.l.b gives the method of calculating the standard deviation from 
this distribution. For these data (family incomes) s 2 = 276,290 and therefore 
s = 526. The value of the mean is 1629-1, and the percentage standard 
deviation is therefore 100 X 526/1629-1 = 32-3. The number required in 
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the sample to give a standard error of 5 per cent, in the estimate of the mean 
Income per family is therefore 



For a standard error of 1 per cent, the number required would be 1050. 

A similar calculation for the wheat acreages of the random sample of 
Hertfordshire farms (Table 7.2), already described in Section 3.7, is given in 
Example 7.2.b. In this case s = -\/1351 4 = 36-8. The mean wheat acreage 
per farm is 186, and the percentage standard deviation is therefore 
100 x 36-8/18*6 = 198. The percentage standard error is here very large 
because there are a large number of farms growing little or no wheat. In order 
to determine the total wheat acreage of an area with a 5 per cent, standard 
error from a random sample of farms, therefore, we shall require 
198 2 /5 2 = 1570 farms. 

From the above formulas we see that the standard errors of estimates 
derived from random samples of different sizes taken from the same population 
are inversely proportional to the square roots of the numbers in the samples. 
Conversely, to reduce the standard errors of the results in a given ratio we 
require to increase the size of the sample by the square of the ratio. Thus, 
in order to halve the standard errors of the results we must multiply the size 
of the sample by 4. 

4.32 Some general rules on size of sample 

From the above discussion it will be seen that the calculation of the size 
of sample required to attain a given accuracy is a relatively simple matter 
when a random sample is taken. With the more involved types of sampling 
the calculations are more complicated, and more must be known of the material 
that is being sampled. 

Calculation of the accuracy which would be attained by a random sample 
is, however, often a useful preliminary guide to the size of sample likely to be 
required in the more involved types of sampling. If only one type of sampling 
unit is under consideration, the reduction in numbers of units required with 
the more complicated types of sampling is determined by the fraction of the 
total variability which is removed by the imposition of restrictions such as 
stratification or by the use of supplementary information. It is frequently 
possible to form a rough idea of the likely reduction from a general knowledge 
of the characteristics of the material. Thus in a survey designed to determine 
crop acreages, using farms as sampling units, it is to be expected that stratification 
by size of farm and the use of a variable sampling fraction in conjunction with 
such stratification will each give considerable increase in accuracy over a random 
sample. This is confirmed by the results already given in Section 3.7. 

When sampling units of different types or of alternative sizes are under 
consideration, the situation is more complicated, as is shown by the results 
already presented in Section 3.11. This is true also of multi-stage sampling. 
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The following general rules may be of value. Rules 1 to 5 are applicable 
to the case in which only one type of sampling unit is under consideration, 
rules 6 and 7 to the case in which more than one type of sampling unit is being 
considered. 

(1) The use of stratification, a variable sampling fraction or supplementary 
information may in general be expected to increase the accuracy. 
Consequently the calculation of the number of units required in the 
case of a random sample gives an upper limit to the number of units 
required in any reasonable form of sampling using the same sampling 
units. 

(2) Stratification will only increase the accuracy substantially if there are 
marked differences between the different strata. The increases are 
usually larger for quantitative characters than for qualitative characters, 
z>. attributes (Table 3.7.b). 

(3) A variable sampling fraction can greatly increase the accuracy when the 
units vary greatly in size, or more generally in variability from stratum 
to stratum. Fractions which increase the accuracy for quantitative 
characters may reduce it for qualitative characters (Table 3.7.b). 

(4) The use of supplementary information can greatly increase the accuracy 
in appropriate cases, and often serves as an alternative to stratification 
(Table S.T.b), 

(5) Since there must be at least one unit per stratum, more detailed 
stratification is possible with larger samples. In such circumstances 
the increase in accuracy with increasing size of sample will be more 
rapid than is indicated by the square-root law. Conversely, for samples 
of a given accuracy the advantage of stratification may be reduced by the 
fact that reduction in the size of the sample necessitates an increase 
in the size of the strata (Section 8.15). 

(6) If sampling units of type A consist of aggregates of sampling units of 
type B (e.g. households and individuals), the use of sampling units of 
type A in place of units of type B will usually result in lower accuracy 
for a given amount of material in the sample (Table 3 . II .b and Example 
7.8.b). 

(7) If multi-stage sampling is used, more final-stage units will be required 
than will be the case with single-stage sampling of the final-stage units 
(Tables 3.7.b and 3.11. b). 

All the above rules are indicative only. The quantitative gains in accuracy 
or reduction in number of units required in any particular case must be evaluated 
by the methods described in Chapter 8. The final decision as to the type of 
sampling to be adopted necessarily depends on the relative accuracy of the 
various methods and their relative costs. 

It is advisable at the planning stage to consider as far as possible the form 
in which the results require to be presented. In more complicated surveys, 
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particularly of the exploratory and research type, the results themselves will 
in part suggest the form in which they require to be presented, but in simple 
types of survey the form of presentation can often be laid down in considerable 
detail. This is a help in verifying that the sample is sufficiently large to cover 
the required domains of study adequately. 

4.33 Pilot and exploratory surveys 

From what has been said in the previous sections of this chapter it will 
be apparent that there are many points on which decisions can only properly 
be reached after preliminary investigations in the form of a pilot survey have 
been carried out. On material of which nothing is initially known, e.g. in 
surveys of undeveloped territory, a preliminary exploratory survey may be 
required before any proper pilot survey can be undertaken. In addition to 
providing general information, such an exploratory survey may be used to 
construct a first-stage frame. 

A pilot survey has two main objects : firstly, the provision of information 
on the various components of variability to which the material is subject, and 
secondly the development of field procedure, the testing of questionnaires and 
the training of investigators. A pilot survey may also provide data for the 
estimation of the various components of cost of the different operations involved 
in the survey, e.g. interview time, time of travel, etc. Knowledge of such 
costs is required not only as a basis for general estimates of cost, but also in 
order to determine what type and intensity of sampling will be most efficient. 

A further function of pilot surveys is to determine the most effective type 
and size of sampling unit. In a crop-cutting survey involving the harvesting 
of small areas, for instance, we may require to determine the best size and 
shape for these areas. In order to investigate the variability of different types 
and sizes of unit it is necessary to be able to form aggregates which represent 
the largest units which are of interest. Thus, if areas ranging from, say, 
-| ft. X 1 ft. to 3 ft. X 3 ft. are under consideration, it is necessary, or at least 
preferable, to harvest randomly distributed areas of 3 ft. X 3 ft. in sections 
of J ft. X 1 ft (Section 8.14). 

Pilot surveys will not normally be required for material on which there is 
considerable previous survey experience, since every survey provides information 
on the variability of the material surveyed, and this can often be used in the 
planning of further surveys. Thus the 1942 Census of Woodlands (Section 4 . 25) 
was planned on the basis of experience gained in the 1938-9 Census of 
Woodlands. Even in such cases, however, new questionnaires and new methods 
of observation and measurement should be tested on a more or less random 
sample of the material before being put into operation. 

The testing of the field procedure by means of a pilot survey is discussed 
in Chapter 5, and its planning needs no special comment. The planning of a 
pilot survey to provide relevant information on the various components of 
variation is rather more difficult. The finer peints will be fully apparent after 
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the estimation of efficiency has been discussed (Chapter 8), but a few fundamental 
points may be made here. 

At first sight it might be thought that a fully random sample would be 
a satisfactory form of sample for a pilot survey. This, however, is not necessarily 
the case. As a simple example we may consider the survey of material in which 
the use of a stratified sample is likely to be appropriate. In this case we shall 
require to determine the components of variation within strata. This can 
only be done from a fully random sample if the sample is sufficiently large 
relative to the number of strata for the majority of strata to contain at least 
two units. 

This difficulty can be overcome by adopting some form of multi-stage 
sampling, so that the whole of the pilot sample is concentrated in a few of 
the strata. The primary stage of this multi-stage process need not necessarily 
be very rigorous. Thus in a survey covering a human population concentrated 
in villages, it may be considerably more convenient to use certain villages for 
the pilot survey rather than others. Provided that sufficient is known of these 
villages to indicate that they are fairly typical there is no serious objection to 
their use. Similarly in area sampling it may be sufficient to confine the pilot 
survey to districts conveniently situated with regard to the main and regional 
headquarters, provided there is some assurance that the different types of 
district are properly represented. 

Within the towns or areas selected for the pilot survey a fairly intensive 
survey can be made. If necessary a further stage may be introduced into the 
sampling process. Thus the survey may be confined to a selection of blocks 
in a city, instead of the whole city being sparsely covered. By this means, 
it is possible to obtain data which cover selected areas with a density which 
is of the same order as that which will be adopted in the final survey. If this 
is done the various possible types and sizes of strata can be effectively 
investigated. 

The concentration of the pilot sample into selected areas should not be 
pushed to extremes. It is better to have adequate cover of a representative 
sample of the data than highly detailed cover of small and possibly non- 
representative sections of it. Detailed cover will in any case not be required 
if the material is such that very small strata are known to be impracticable 
or of no value. 

When the possibilities of multi-stage sampling have to be investigated, 
the problem of designing a pilot survey is more difficult. The component 
of variability that governs the sampling error for the whole population is the 
variability of first-stage units (Section 7,17). Only if an adequate number of 
first-stage units are represented in the pilot sample will it be possible to make 
any reliable estimate of their variability. Moreover the first-stage units must 
be selected at random from within the first-stage strata that will be finally 
adopted. 

For this reason a much more extensive pilot survey is required for a multi- 
stage survey if any reliable preliminary estimate of the sampling error is to 
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be obtained. In surveys in which comparatively few first-stage units are used, 
such as localized surveys on human populations (Section 4.18), the prior 
determination of the expected accuracy from pilot survey data is likely to 
prove to be impracticable. Reliance will then have to be placed on previous 
experience or on estimates of error from previously available data. This, 
however, is not quite so serious as it appears at first sight, since multi-stage 
samples with relatively few first-stage units are in general used most frequently 
in surveys of the research and investigational type, or in surveys which are 
repeated at intervals. In such cases the first few surveys can be used to provide 
data on the various components of variation, and the design can be modified 
if necessary in the light of this information. 

Even if the first-stage sampling error cannot be determined by means of 
a pilot survey, such a survey can be made to furnish reliable Information on 
the errors to be expected at the second and subsequent stages. This will enable 
the survey to be planned so that the necessary accuracy is obtained on 
comparisons between first-stage units, which frequently form important domains 
of study in surveys of this type. 

Elaborate pilot surveys are not likely to be worth while in small-scale 
surveys. It is usually better to proceed with the actual survey work, even if the 
design adopted is not as efficient as would be possible if a full-scale pilot survey 
were first undertaken. If a series of small-scale surveys of similar type have 
to be undertaken the earlier surveys will themselves act as pilot surveys for the 
later surveys, the design of which can be modified in the light of the experience 
gained. 
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CHAPTER 5 

PROBLEMS ARISING IN THE EXECUTION AND 
ANALYSIS OF A SURVEY 

5.1 Types of problem 

The problems arising in the execution of sample censuses and surveys are 
for the most part similar to those encountered in complete censuses. We 
shall therefore not discuss these problems in detail, but merely draw attention 
to some of the points which are of particular importance when sampling is 
used. 

The various phases of the work subsequent to the planning stage may be 
broadly classified as follows : 

(1) Setting up of the general administrative organization. 

(2) Design of forms. 

(3) Selection, training and supervision of the field investigators. 

(4) Control of the accuracy of the field work. 

(5) Arrangements for follow-up in the case of non-response. 

(6) Abstraction and coding of the information. 

(7) Statistical analysis, 

(8) Reporting. 

In certain types of survey no action may be required under some of these 
heads. In a survey conducted by postal questionnaire, for example, there will 
be no field investigators unless they are required to deal with cases of 
non-response. 

5.2 Administrative organization 

The administrative organization required will depend very much on the 
nature and scale of the census or survey, and on the area to be covered. 

The main field task for which an extensive administrative organization is 
required is the supervision of the investigators, or the carrying out of follow-up 
enquiries in cases in which there is no staff of investigators to undertake this 
work. The main administrative task at headquarters is the supervision of the 
computing and clerical staff engaged in the abstraction and analysis of the 
completed forms. 

Every opportunity should be taken to utilize existing administrative and 
office organizations. When the survey covers a large area, supervision from 
a central office is likely to be difficult and in such cases it is best to establish 
regional offices. Very frequently some existing organization can be used for 
this purpose. 

It is not necessary for the computing and clerical staff at headquarters to 
be administered by the same organization as the field staff. In many cases 
it is convenient to use some existing statistical organization to carry out the 
analysis, and to utilize some administrative organization with regional offices 
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to supervise the field work. Often an organization can be employed which 
already has contact with the respondents, or which has on its staff individuals 
who are suitably qualified to act as field investigators. 
5.3 Design of forms 

Most careful attention should be given to the detailed design of the various 
forms that will be used in the course of the census or survey, especially the 
forms on which the observations and answers to questions are recorded. This 
applies also to the instructions and explanatory notes which accompany the 
forms. 

The content of the forms for the recording of the information is determined 
by the information that is required, and has already been discussed. They 
may be forms designed for completion by the recipients with little or no 
assistance, questionnaires which form the basis of interviews, or forms on 
which obseivations and measurements taken in the field are recorded by the 
field investigators. 

Each type of form presents its own difficulties of design. The simplest 
is that on which observations and physical measurements are recorded by the 
field investigators themselves. In this case, the chief points to observe are 
that the form is convenient to use, and that the results are set out in such a 
manner that they are convenient to abstract. Figures which have to be 
summed by the field investigators, for example, should be arranged vertically 
and not horizontally, as the investigators will not be using calculating machines. 

In surveys which involve observations and physical measurements it will 
almost always be necessary to supply field investigators with a separate set of 
instructions. Consequently there is no need for the form to carry its own full 
explanation, though it should of course be made as self-explanatory as possible. 
Experience has shown that instructions to field investigators should be very 
detailed, and should cover all possible points of uncertainty or ambiguity. 
Provision should also be made for revision and amendment as need arises, 
since it is extremely difficult to draw up a set of instructions which are completely 
unambiguous and deal with all possible contingencies. 

In forms of the census type, designed for completion by the recipients 
without assistance, very careful attention must be paid to the exact wording 
both of the questions and explanatory notes, so that there is no doubt in the 
mind of the recipient as to what is required. Detailed and lengthy explanations 
should be avoided as far as possible. Such explanations as have to be given 
should if possible appear in conjunction with the question to which they refer. 
The common practice of giving detailed explanatory notes on the back of a 
form is not very satisfactory, since it frequently results in the respondent 
filling in the whole or portions of the form without consulting these notes. 
Forms of this type should, if possible, carry a brief explanation of the reasons 
for the census. Even if this has been given in the press and elsewhere it is 
unlikely that all recipients will in fact have seen it. 

In forms of the questionnaire type designed for completion by field 
investigators the investigators must be instructed whether the questions are 
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to be put in the exact form given, or whether they can be asked in a general 
form. As already stated, in most cases the general form is more suitable, but 
in questions on opinions, where different forms of wording may be expected 
to affect the answer, it may be necessary to adhere to an exact form. 

With the general form of question explanatory notes are often required 
In order to make clear to the investigators exactly what information is required. 
Such explanatory notes can either appear on the questionnaire itself or be given 
in a separate set of instructions. The latter course results in a much more 
compact form of questionnaire and is suitable when full-time investigators are 
used. The former course is more likely to ensure that all investigators are in 
fact aware of what is really required and is best when the investigators are 
carrying out the survey in the course of other duties. In a lengthy questionnaire 
this will necessitate the questionnaire being in the form of a booklet. Such 
questionnaires are more bulky and costly, and frequently entail more work 
in the coding of the results, but are nevertheless frequently preferable in these 
circumstances. 

Forms may be either printed or duplicated. Printing is much to be preferred 
as it results in much neater, clearer, and more compact forms. The ordinary 
type of duplicating paper is also not very suitable for writing on, particularly 

in ink. 

Small forms may be printed on cards instead of paper. Cards are often 
more convenient for field use, and in small surveys of which the results are 
analysed by hand the use of cards may save transcription before analysis. 
Alternatively Cope-Chat cards may be used (Section 5.10). 

Forms printed on paper may be made up in the form of blocks with card- 
board backs. This facilitates writing in the field. Alternatively they can be 
clipped on to a wooden board. If duplicate copies of the completed forms 
are required, provision should be made for carbon copies to be taken at the 
time the forms are filled in. 

Forms larger than foolscap should be avoided if possible. They are 
troublesome both to handle and to store. Forms of more than one sheet should 
also be avoided. It is usually better to use both sides of a sheet or card than 
to use two sheets. 

Forms should always be subjected to a preliminary trial in the field. Only 
in this way will minor faults be discovered. In the case of questionnaires this 
test is best arranged in two parts : 

(a) a trial by investigators who are fully experienced in questionnaire 
work, and who are conversant with the problems under investigation ; 

(b) a trial by investigators of the type that are to be employed in the 
survey. 

The first trial will serve to determine whether the questionnaire is in the 
form most suitable for eliciting the required information from the respondents, 
and the second trial will provide information on whether the questions and 
associated instructions are understood by and within the capability of the 
investigators. ^ 4 
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5.4 Special tests of questionnaires and investigators 

In certain cases it may be worth making rigorous tests of different forms 
of the same question to see whether there are any material differences in the 
answers received. Since the question cannot be put in both forms to the same 
respondent, this must be done by the use of interpenetrating samples, using 
the same investigator or investigators for both forms. In order to eliminate 
the effect of any progressive change in the investigators or the respondents the 
tests of the two forms should proceed simultaneously. In the same way the 
difference between two or more investigators using the same form of question 
can be tested. 

More elaborate and precise tests of differences resulting from different 
forms of the same question and different investigators can be carried out by 
using the methods developed in the design of experiments. Thus if two forms 
P and Q of a question and three investigators A, B and C require to be tested, 
groups or blocks of six respondents may be used. The blocks should be chosen 
in such a manner that the respondents within each block are as alike as possible, 
using any available prior information. The six question-investigator com- 
binations PA, QA, PB, QB, PC, QC are then assigned at random to the 
respondents of each block. This design is technically known as a 2 x 3 
factorial design in randomized blocks. By this device differences between forms 
of question and between investigators are simultaneously tested. Information 
is also obtained on what are known as the interactions between forms of question 
and investigators, i.e. on whether the differences between the forms of question 
are different for the different investigators, and vice versa. The grouping of 
respondents into blocks ensures that errors due to differences between 
respondents are eliminated as far as possible ; the randomization enables the 
standard errors of the comparisons to be calculated by the methods appropriate 
to the analysis of replicated experiments (see for example Fisher's Design of 
Experiments or Snedecor's Statistical Methods).* 

Investigations of this kind can be carried out in the course of an actual 
survey, but they are normally better undertaken as a special investigation or 
as part of the pilot survey, since information on different forms of question 
will be required at the planning stage, and it is usually inadvisable to complicate 
the field procedure of a large survey. Routine tests of differences between 
different investigators may, however, be incorporated without undue 
complication in the actual survey by means of interpenetrating samples. 

5.5 Selection, training and supervision of field investigators 

Field investigators may be specially appointed, they may be members of 
existing staffs appointed for other work but over whom authority can be 

* An interesting investigation of the differences between three groups of interviewers 
(two belonging to professional organisations, one of university students) has recently been 
carried out by the Department of Research Techniques, London School of Economics 
(Durbin and Stuart, 1951, D', Booker and David, 1952, D'). A factorial design was 
employed. 
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exercised, or they may be Individuals asked to undertake the work on a voluntary 
basis or for a small honorarium. 

The problem of selection arises primarily In the case of investigators 
appointed specially for the work. In order to secure a suitable type of person 
preliminary tests should if possible be made of all applicants, and the early 
work of newly appointed investigators should be carefully watched and super- 
vised. In large-scale censuses and surveys, proper training courses should 
be arranged. If a pilot survey is undertaken this provides a valuable 
opportunity for training, and every attempt should be made to build up the 
team of investigators at this stage rather than later, even if this involves a 
certain amount of additional expense. 

It is of the greatest importance that investigators, once they have been 
trained and are found suitable, remain in the job. Every effort must therefore 
be made to see that the pay is adequate, and that the work is made as attractive 
as possible. In the case of the interview type of survey, investigators are 
sometimes paid on piece rates at so much a completed questionnaire. This 
is In general unsatisfactory, since It tends to lead to skimped work and to 
irregularities such as substitution of one respondent for another. 

It should not be forgotten that field work of the Interview type is very 
arduous and is found by almost all investigators to Involve considerable mental 
strain. Hours of work are also likely to be irregular, since if excessive 
non-response Is to be avoided some evening interviews are almost inevitable. 
Investigators should therefore not be expected to work excessively long hours, 
and should if possible be given a rest on other work from time to time. It is 
often advantageous to bring full-time investigators to headquarters at intervals 
and use them for office work such as abstraction and analysis of the results. 
This not only serves to provide a break from field work, but also enables them 
to gain a much better insight into the purposes of their work. 

Whatever the conditions of work and form of payment, there must be 
adequate field supervision. The supervisors should themselves undertake 
field work from time to time, so that they are in a position to appreciate the 
difficulties of the work, and should also contact the workers while they are 
actually in the field. Provision should be made for personal contacts not only 
between supervisors and the field investigators, but also between supervisors 
and the headquarters staff. In long-term surveys it is also often advantageous 
to arrange conferences of the investigators from time to time at which difficulties 
can be discussed and the whole progress of the survey reviewed. 

5.6 Control of the accuracy of the field work 

The best assurance that the field work shall be accurate Is that the 
investigators are thoroughly trained in their work, and arc capable, conscientious, 
and keen. Nevertheless it is important even with the best investigators to 
keep a close watch on the progress of the work. 

In certain cases, particularly in surveys involving obst -vations and physical 
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measurements, it is possible to arrange a system of field checks by the super- 
visors. These should preferably be carried out on a random sub-sample of 
units, and should in any case be conducted in such a manner that the investigators 
cannot know which parts of their work will be checked. Checks of this type 
will not usually be possible in the interview type of survey, as it is clearly 
impracticable to ask for the same information twice from the same individual. 

A preliminary examination of the completed forms must be made as soon 
as possible after they are completed. In this way defective work, in so far as 
it reveals itself in the forms themselves, is brought to light immediately, and 
remedial action can be taken. If the census or survey is such that a large 
volume of work is turned in by each field investigator, and it is not considered 
necessary to give individual scrutiny to all the returns, a proper sample of the 
work of each investigator should be scrutinized as a routine matter. 

The investigators should themselves be instructed to carry out any simple 
numerical calculations that are required on the forms, and also to look through 
the forms before sending them in to see if they are in satisfactory order. On 
the other hand extensive revision of the foVms should not be permitted. The 
preparation of fair copies by the investigators is in general undesirable, since 
it leads to copying errors and also makes any judgment on the quality of the 
work more difficult. If fair copies are permitted the originals should be returned 
together with the fair copies, and a certain percentage at least should be checked 
for copying errors and other changes. 

If the questionnaire is such that the investigator has to furnish or amplify 
the answers to some of the questions from notes taken at the interview this 
should be done immediately after the interview rather than at the end of the 
day, even if this course is somewhat inconvenient. In general, however, it 
is best for the information to be written down in its final form at the interview, 
any supplementary observations by the investigator being given under a separate 
heading. 

If comparisons between the different investigators by means of inter- 
penetrating samples have been arranged, the comparative results must be 
made available as quickly as possible, in order that effective action may be 
taken if discrepancies are discovered. On the other hand the use of 
interpenetrating samples should not be made an excuse for the relaxation of 
other forms of control. Interpenetrating samples are not likely to reveal minor 
defects in an individual investigator, and they will certainly not reveal faults 
which are common to all investigators. They should therefore be regarded 
as a check against major defects in individual investigators rather than as a 
complete control of all investigators. 

5.7 Arrangements for follow-up in the case of non-response 

The follow-up arrangements will naturally vary very greatly according to 
the type of census or survey. 

In the case of a postal questionnaire they will often involve an entirely 
different organization from that which is employed to carry out the survey 
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itself. Since postal follow-ups from headquarters are of limited utility, some 
form of local organization which can deal with non-respondents by telephone 
and personal visit is required. 

In surveys using field investigators careful instructions must be issued 
in order to be sure the follow-up arrangements are properly carried out. 
Specific warnings should be given against such practices as substitution of 
neighbouring households when there is no response. If the follow-up is to 
be made on a sub-sample only of the non-respondents exact instructions for 
taking this sub-sample must be given, so that it can be obtained as soon as 
the non-respondents are known. In general some very simple sampling method, 
such as taking of every qih non-respondent, is adequate. Such a procedure 
has the advantage that a list for follow-up can be prepared at the time of the 
original non-responses, if necessary by the field investigator concerned. 

5.8 Statistical analysis 

Statistical analyses of the results of complete censuses and surveys are 
mostly based on counts of numbers of units falling in different classes and 
sub-classes, and on the totals for these classes of recorded quantitative variates. 
The units to which these counts and totals refer may be either the sampling 
units or some other natural units. In certain cases totals are also required for 
ratios or other quantities calculated from the values recorded for the individual 
units. 

From these numbers and totals the means can be calculated for the different 
classes. Basic summary tables can then be prepared. In these summary 
tables frequencies based on counts are often expressed in percentages, the 
bases of the percentages being chosen so as to exhibit the differences in 
proportions which are of interest. When the summary tables have been 
prepared, more critical statistical analysis of these tables may be required in 
order to isolate the effects of the various factors which arc believed to influence 
the results. 

The treatment of the results of sample censuses and surveys is similar 
in most respects to that of complete censuses and surveys. If, however, the 
sampling fractions are different for the different units, the appropriate weights 
have to be applied at some stage of the calculations. The utilization of 
supplementary information will also necessitate adjustment of the basic totals 
and means. The appropriate formulae for these operations, which differ 
according to the type of sampling adopted, are given in Chapter 6. 

In sample censuses and surveys we shall normally require estimates of the 
sampling errors. In addition, investigations of the relative efficiency of 
different sampling methods may be undertaken in order to improve the efficiency 
of future surveys on the same or similar material. 

The various stages in the computations may therefore be classified as 
follows : 

(1) Preliminary computations on the values of the individual returns, 
such as the calculation of ratios and the introduction of individual weighting 
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fectors. If punched cards are used, some or all of this work may be carried 
out mechanically, subsequent to the punching of the cards. 

(2) Abstraction and coding of the results so that they are in a form suitable 
for analysis or for transfer to punched cards. 

(3) Punching (in cases where punched cards are used). 

(4) Counts and totals. 

(5) Preparation of the summary tables from these counts and totals, 
including adjustments for supplementary information and any weighting not 
already carried out. 

(6) Calculation of sampling errors and investigations of efficiency. 

(7) Critical analysis. 

Apart from the calculation of sampling errors and investigations of efficiency, 
which are described in Chapters 7 and 8, we do not propose to discuss these 
operations in detail in this book. In the following sections we will merely 
give an outline of the special points that arise at the various stages. 

5.9 Methods of handling the data 

There are four main ways in which the data accumulated in the course 
of a census or survey may be handled. These are : 

(1) An analysis direct from the forms. 

(2) Transference of the data to ordinary cards. 

(3) The use of cards with holes round the edges (Cope-Chat cards). 

(4) The use of Hollerith or Powers-Samas cards (punched cards). 

The primary function of any type of card is to enable the data to be sorted 
into different classes, so that the numbers of units and totals associated with 
these classes can be obtained without transcription. With plain cards the 
sorting has to be done entirely by hand, with Cope-Chat cards marginal 
punching gives some aid to the hand sorting process, while with punched 
cards the sorting is carried out mechanically, and the counts and totals are 
also obtained mechanically. 

In certain cases the data can be recorded directly on cards which are 
subsequently used in the analysis. These may be either ordinary cards or 
Cope-Chat cards. The use of cards in this manner is limited by the fact that 
the amount of uncoded information that can be conveniently recorded on a 
card is small, and also by the fact that cards tend to be damaged by use in 
the field. 

It is also possible to record information directly on Hollerith cards, either 
in a form which enables it to be read by the punch operator as the card is 
punched, or in a form that enables it to be punched automatically by the process 
known as mark sensing. The occasions on which either of these methods has 
any real advantage over the punching of cards from ordinary forms are 
somewhat rare in census and survey work* 

If only a single classification is required, the preparation of a summary 
directly from the forms is likely to be the most economical method of procedure. 

* But see Section 10.1 ! 109 
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If more than one classification is required, the use of forms may ^ still be 
reasonably economical, particularly for small surveys, but the possibilities of 
using forms in this manner are limited by the fact that paper forms are not 
easily sorted or counted, and will not stand a great deal of Dandling.* The 
direct use of forms is also sometimes of value when a rapid preliminary summary 
of the salient features of a survey is required. In a large survey such summaries 
can usually be based on a small sub-sample of the forms. 

If the data are transferred to cards, some form of compression and coding 
is usually necessary. This enables the information to be recorded in compact 
form on the card, and also facilitates the subsequent counts and summation. 
If Cope-Chat cards are used, all information recorded in punched form must 
be coded, and with punched cards the whole of the information has to be 
coded in numerical (or exceptionally alphabetical) form. 

The fact that punched cards have all their information coded in numerical 
form has the disadvantage that the detailed information relating to separate 
units cannot be easily studied by means of the cards themselves. It is also 
difficult to record written remarks on the cards. This tends to make the 
analysis more mechanical. Punched cards are therefore unsuitable for analyses 
which require detailed examination of the whole complex of information 
relating to individual units. Even in surveys which are so large that analysis 
by means of punched cards is essential it is often advisable to arrange that the 
original forms are kept available, so that in any detailed invcstigational work 
the forms corresponding to selected cards can be extracted and examined 
when required. 

5.10 Cope-Chat cards 

Cope-Chat cards are cards which have a row of holes along each edge. 
A group of these holes can be assigned to each particular classification, e.g. 
the answers to a specific question, each hole being taken to represent one 
class in this classification. The body of the card (front and back) can be used 
for recording written information. 

By means of a punch similar to an ordinary ticket punch, V-shaped notches 
can be cut out of the card so as to obliterate any desired holes. If the cards 
are arranged in a pack and a knitting needle is passed through a particular hole, 
the cards punched in this hole will fall from the pack when the pack is lifted 
by means of the needle and thoroughly shaken. This enables cards to be sorted 
into different classes with considerably greater speed than would be the case 
if the information were merely recorded on plain cards, and the sorting had to 
be carried out by examination of each card. The Cope-Chat method of sorting 
is not fully reliable, since cards do not always fall out of the pack when it is 
shaken, but mis-sorts can be detected by visual inspection of the edges of 
the retained cards. If all classes of a given classification arc coded in some 
mutually exclusive system a positive check will be available. 

* In making counts or calculating totals from forms it is usually best to sort the 
forms into the necessary classes, 
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The amount of information that can be recorded on the edge of the cards 
is limited, since the number of holes is limited by the size of the card. In 
the Survey of Fertilizer Practice, for example, 5 in. X 8 in. cards are used 
with a hole spacing of just over 4 to the inch, giving 105 holes in all.* 

The punching of Cope-Chat cards is somewhat laborious, and if a mistake 
is made a new card has to be prepared. In this case the written information 
will have to be transferred to the new card. For this reason it is usual to mark 
the holes which have to be punched, and check the markings before actually 
punching. If a considerable amount of punching is to be done, a form of gang 
punch can be used which will punch a particular hole from a number of cards 
at one operation. In this case the cards can be sorted into the appropriate 
classes and the sorting checked before punching. A key-operated punch is 
also available. 

When the cards have been sorted they require to be counted by hand. 
If totals of numerical information are required, the summations must be 
performed on an ordinary adding or calculating machine, unless the numerical 
information has itself been coded. The counting of cards is a tedious operation, 
and is made more so by the punching round the edges. For some purposes 
it may be feasible to replace exact counts either by weighing or by measuring 
the aggregate thickness under a definite pressure. Neither method is very 
accurate, however. In a humid climate, for example, the weights tend to vary 
considerably owing to changes in moisture content. 

The use of Cope-Chat cards enables isolated cards having given 
characteristics to be much more readily extracted than is the case when plain 
cards are used. Cope-Chat cards are therefore of value for surveys in which 
units of particular types require to be identified subsequently. In this respect 
they have certain advantages over punched cards, since no elaborate sorting 
mechanism is required and the information concerning the selected units is 
presented in written form. 

Cope-Chat cards also have the minor advantage that the proportions falling 
m different classes can be roughly observed by sorting the cards and then 
examining the distribution of the notches. 

The coding of numerical information on Cope-Chat cards can be carried 
out in a number of ways. If approximate values only are required the data 
may be grouped into size-groups. If exact values are required the simplest 
method is to allocate ten holes to each digit of the number, but this can only 
be done if very little numerical information has to be coded, owing to the 
limited capacity of the card. An alternative is to use some form of two-hole 
code to represent each digit. The most compact is that based on four holes, 
which are taken to denote the digits 1, 2, 4, 7. To code other digits the two 
digits whose sum gives the required digit are punched. Thus the punching of 
1 and 2 represents digit 3, This system is not self-checking on sorts, since 

* In both the Cope-Chat and punched card systems one corner is cut across 
diagonally on all cards so as to provide a check that all cards are right way round 
in the pack. 
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two, one or no holes may be punched. If a fifth hole carrying the value is 
added, every digit can be denoted by a pair of holes, with the convention that 
4, 7 denotes 0. An alternative with five holes, which is simpler, but not self- 
checking on sorts, is to use the holes to denote the digits 1, 2, 3, 4, 5, digits 

over 5 being indicated by double punching. 

5,11 Punched cards 

Two different systems are available, known as the Hollerith and Powers- 
Samas. Both systems employ cards in which each column has 12 positions, 
in any one of which a hole may be punched. When required, two or more 
holes may be punched in different positions in the same column. In both 
systems alphabetical information can be dealt with by means of a two-hole 
code. * 

Hollerith installations employ 80- or 38-column cards. Powers installations 
employ 65-, 36- or 21 -column cards. A given installation will only handle 
cards of one size. By using each column for two items of information, with 
a special form of multiple punching, the numerical capacity of cards of both 
systems can be doubled. 

The actual punching of the cards is normally done by a hand-operated 
key punch. Verification, which checks within certain limitations that the 
original punching is correct, is normally performed by means of a hand- 
operated verifier similar in construction to a punch. More elaborate punches 
of various kinds are also available. 

The main difference between the Hollerith and Powers systems is that in 
the Hollerith system the cards are read electrically, whereas in the Powers 
system they are read mechanically. This results in a greater flexibility in the 
Hollerith system, since the machines can be set up for any required operation 
by means of electric connections through one or more plug-boards. If the 
analysis is confined to sorting and counting, the two systems, apart from card 
capacity, have almost identical performance. For the more elaborate types 
of analysis, Hollerith equipment is more suitable than Powers equipment, 
particularly in surveys of moderate size where many different types of machine 
operation, which often cannot be planned in advance, arc required on relatively 
small batches of cards. We shall here confine ourselves to a description of the 
Hollerith machines, but it should be emphasized that if Hollerith equipment 
is not available it may be preferable to utilize existing Powers equipment rather 
than send the work elsewhere or use methods not involving punched cards. 

The principal Hollerith machines are :< 

(1) The sorter, 

(2) The sorter-counter, 

(3) The tabulator, 

(4) The reproducing summary punch, 

(5) The multiplying punch, 

(6) The collator. 

* 3ee Section 10.1 for notes on some recent developments. 
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The descriptions which follow are not intended to give a complete account 
of these machines and their various modifications, but only an indication of 
the way in which they work and the simpler types of operation that can be 
undertaken with them. An expert should always be consulted when planning 
any extensive punched-card work. Initial consultations should take place 
before the coding of the material is undertaken. 

5.12 The sorter and sorter-counter 

The sorter can be set to operate on any one column of the card. When 
the cards are passed through the machine they are separated into 12 boxes 
corresponding to the 12 positions of the holes punched in this column, with 
an additional box for cards with no hole in the column. If, therefore, a code 
representing some classification of the material into anything up to 12 classes 
is punched in the column, the cards corresponding to the different classes 
will be sorted into the different boxes. A classification with more than 12 and 
up to 144 classes can be coded on two columns, and by sorting successively 
on each of the two columns separation into all the classes can be effected. 
In the same way, if a group of columns or field is used to denote a number, 
the cards can be arranged in numerical order by sorting first on the units, then 
on the tens, and so on. Equally, if two columns represent two different 
classifications the cards can be sorted into the various cells of the two-way 
classification so formed. If two holes are punched in the same column the card 
is sorted to the higher digit, unless sorting on this digit is suppressed. 

The sorter is normally used for arranging the cards of the pack into groups 
or into a given order prior to their passage through the tabulator. When this 
is done the whole of the cards are kept in one pack, i.e. at the end of each sort 
the cards are collected from the separate boxes and the sub-packs are placed 
together in numerical order. 

If only counts are required it is possible to obtain these directly on a sorter 
with a counter device which registers the numbers of holes occupying the 
various positions in the given column. A machine with this device is called 
a sorter-counter. Sorting can be suspended during counting if desired. 

The ordinary sorter-counter counts on a single column only and does not 
print the results. For large-scale census work more elaborate types of sorter- 
counter are available which will count simultaneously on a number of columns, 
printing the results obtained in these counts. 

5.13 The tabulator 

The tabulator is a much more elaborate machine than the sorter. Its 
primary function is to add numbers punched in a given field from a group of 
cards. To effect this the numbers are read successively as the cards pass through 
the machine, being added on one of a set of counters which form part of the 
machine. The machine has a printing device which will print the totals 
accumulated in the counters, and will also, if desired, print numbers read from 
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the cards. The operation of obtaining and printing the totals is called 
tabulation, and that of printing numbers read from the cards is called listing. 

Most tabulators have a number of counters and print banks ; the totals 
of several fields can therefore be accumulated simultaneously. If the numbers 
in the fields concerned are sufficiently small, two or more fields can be 
accumulated in different parts of the same counter, thereby further increasing 
the capacity of the machine. 

In order to enable the totals of the groups of cards in different classes to 
be obtained successively without stopping the machine and without having 
to feed in the groups of cards separately, a device known as the control is 
incorporated in the machine. This device is such that when wired to control 
on a given column, the machine will break control if the card following the 
one that is being added carries a different designation on the control column. 
This break of control stops the adding process and gives the machine certain 
instructions as to printing and clearing, e.g. it can be wired so that the total 
already obtained is printed and cleared before passing on to the next group 
of cards. Thus, if a pack of cards sorted into groups corresponding to^the 
code on a single column is passed through the tabulator with the control wired 
to that column, the machine will break control at the end of each group and 
the group totals of any desired field can thereby be obtained. 

The control can be arranged to operate on a number of columns, and 
different stages of the control can be associated with the different columns. 
Different instructions can be given to the clearing and printing mechanisms 
according to which stage of the control is operating. Thus, for example, it 
is possible to obtain totals of main and sub-groups simultaneously by feeding 
the numbers from the given field into two different counters, one of which 
is cleared at the end of each sub-group and the other at the end of each main 

group. 

Counts can be carried out on the tabulator, either in conjunction with a 
tabulation or independently, by what is known as the card count. This^feeds 
1 into any desired counter at the passage of each card. The control and printing 
mechanisms operate as before. 

The more elaborate forms of tabulator have a number of auxiliary devices 
which considerably increase their potentialities and flexibility. The two most 
important in the British machines are the rolling feature and distributors. In 
the rolling total tabulator, numbers can be transferred or rolled from one counter 
to another, either positively or negatively, according to instructions issued by 
the control mechanism. Distributors enable numbers read from a field to be 
directed to different counters, and also enable numbers taken from one counter 
to be directed to different counters in rolling, or to different print banks. 
When used in the first manner the distributors operate on instructions read 
from some other column of the card. 

A single distributor, for example, enables positive and negative numbers 
in the same field to be distributed into two counters according to their sign 
(punched in code in another column). By rolling the total of the negative 
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counter negatively into the positive counter at the end of the group the correct 
total (or its complement if negative) is obtained. A further device replaces 
the complement by the negative total on printing. 

By using four distributors it is possible to make counts of classes represented 
by a code in a single column without sorting on that column. In this case 
the card count is fed through the distributors which are controlled by the 
punching of the column in question. The wiring is so arranged that the card 
count is directed to a different counter or part of a counter for each of the 
12 possible code punchings. 

The use of a tabulator in place of a sorter-counter for counting, either with 
or without distributors, has the advantage that the results are obtained in 
printed form. It also enables the whole pack of cards to be kept together, 
since the different main classes are automatically separated by means of the 
control. 

It should be noted, however, that when there is multiple punching in the 
column being counted, the tabulator cannot be used for counting if only four 
distributors are available. A sorter-counter counts all holes punched in a 
column, e.g. if 4 and 8 are punched both the 4 and the 8 will be counted. 

Rolling can be used to carry out simple multiplications. In the National 
Farm Survey analysis, for example, a variable sampling fraction with values 
1/20, 1/10, 1/4, 1/2, 1/1, was used for the different size-groups. Multiplication 
of the size-group totals by their appropriate raising factors was effected as 
follows : 

(a) For multiplication by 2 the total was rolled into itself. 

(b) For multiplication by 4 operation (a) was repeated. 

(c) For multiplication by 10 the total was rolled through a distributor so 
as to give transposition by one place. 

(d) For multiplication by 20 operations (a) and (c) were combined. 

Sums of squares and products, which are required for the estimation of 
sampling errors, regression coefficients, etc., can be obtained on a tabulator 
by what is known as progressive digiting. Suppose the sum of the products of 
two sets of numbers A and B is required, and that all the A y $ are single digit 
numbers. The cards are sorted on A and then fed through the tabulator, 
9's first, the B's being summed and the progressive total printed (without 
clearing) at each change of digit of A. The sum of these progressive totals 
(excluding the final or total) gives the sum of the products of A and B. 
On a rolling total tabulator this summation can be carried out on the tabulator, 
provided steps are taken by the insertion of extra cards to see that every digit 
is represented. If the A's contain more than one digit each digit is treated 
separately, with multiplication by 10, etc., before the results are combined. 
The whole of this operation can be effected in full on the larger rolling total 
tabulators. Subject to limitations of capacity, sums of products of A with 
itself and several other numbers B, C, D, etc., can be carried out simultaneously 
without additional sorting or tabulation. 
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The American tabulators do not have the rolling feature, but are capable 
of carrying out direct addition or subtraction according to card designation. 
In contrast to the rolling total tabulator, the counting wheels may be grouped 
in counters entirely at will, which enables the counter capacity to be used 
more efficiently. These tabulators, however, have no distributors, which 
detracts somewhat from their usefulness in the analysis of survey data. 

5.14 The reproducing summary punch 

The reproducing summary punch has two main functions. One is 
reproducing information from a pack of cards on to the corresponding cards 
of another pack. The second is what is called gang-punching, that is, the punching 
of information, read from the first or master card of the group, on the whole 
of a group of cards. The punch can also be used in association with a tabulator 
to punch on to new cards the results obtained in the course of a tabulation. 

Jn reproduction the information in any set of columns can be transferred to 
the same or any other set of columns. In gang-punching the information will 
normally be punched in the position in which it appears on the first card of the 
group. If selectors are fitted to the reproducing punch, transference to other 
columns (offset gang-punching) is possible for a number of columns not exceed- 
ing the total number of points available on the selectors. 

The reproduction of information from one pack on to another is of value 
in survey work in a number of ways. In addition to the obvious function of 
making a new pack of cards when an old one has become worn, it can be used 
to bring together on to a single card items of information referring to the 
same unit and recorded on two or more separate cards, so that the association 
between these items can be analysed. It also provides a satisfactory means of 
entering new information on to cards that have already been punched. Instead 
of punching the new information on the old cards directly, this information is 
punched on to a new pack, together with the code numbers of the units, and 
the information from the new pack is then transferred to vacant columns on 
the old pack or vice versa by means of the reproducing punch. 

When transferring information from one pack to another pack which itself 
already carries information, the two packs must of course be sorted into the 
same order. The machine, however, checks that there is correct correspondence 
between each pair of cards, and also checks that all transferred information is 
correctly punched. 

Gang-punching can be used to save hand punching when a batch of cards 
which are to be punched all carry the same code in a number of columns. It 
can also be used to transfer information from the main card on to secondary 
cards or trailers referring to the same unit. In this case the main cards act 
as master cards. In gang-punching with interspersed master cards, the master 
cards must carry an X in one of the columns in which none of the remainder 
of the cards carry an X, If such an X is not already punched it can be gang- 
punched, 
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A further use in survey work is the calculation of percentages, index 
numbers, etc. A simple example will illustrate the procedure. Suppose it is 
required to express the numbers B as a percentage of the numbers A, both 
sets of numbers being punched on the cards. By suitable sorts we can assemble 
the cards in batches such that the percentage value for all the cards of each 
batch is the same. Cards for which A has the value 45, for example, and a 
value of B between and 2 will have a percentage value of (to the nearest 
10 per cent.), those with B between 3 and 6 will have a percentage value of 
10, etc. Master cards are therefore made out carrying the values 
A B Percentage 

45 

45 3 10 

45 7 20 

etc., with similar sets for other values of A, and these are added to the pack, 
which is then sorted into numerical order of the ^4's, and of the J3's within 
A 9 s. All the A y s having a value of 45, for example, will now be together, those 
with B between and 2 being preceded by the first of the above master cards, 
those with B between 3 and 6 by the second, etc. If the whole pack is then 
passed through the reproducing punch the correct values of the percentages 
will be gang-punched into the remaining cards from the master cards. 

The disadvantage of this procedure is that a large number of master cards 
are required to cover with any high degree of accuracy fields which are at all 
extensive. Time and expense is therefore involved in the preparation of the 
cards, and they also add to the total volume of sorting required. The method 
is therefore most suitable when ratios and indices of low accuracy are required 
for large batches of data. In the analysis of the National Farm Survey it was 
used for the calculation of the percentage of the acreage of individual farms 
which was arable (12 classes), rent per acre (12 classes), etc., and for the 
combination of several items of qualitative information into a single index. 
A description of the procedure, and a method of preparing master cards by 
use of the gang punch, is given by Kempthorne (1946, B). 

5.15 The multiplying punch 

The multiplying punch is designed to read two numbers from a card, 
calculate the product and punch the result, with suitable rounding- off, in any 
field of the same card. It can also be set to read the multiplier from inter- 
spersed master cards, so that the numbers on the whole of a group of cards 
are multiplied by the same factor. Cross-footing multiplying punches will 
also add or subtract two or three numbers read from the same card and punch 
the result, or add one or two numbers to the product of two other numbers. 

The chief use of the multiplying punch in survey work is in the calculation 
of products of various kinds prior to summation. It can be used for the 
calculation of ratios if the reciprocals of the divisors are punched on master 
cards and the cards are then sorted according to their divisors. It is, however, 
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relatively slow in operation. Consequently when a large amount of material 
has to be handled and high accuracy is not required in the products or ratios 9 
the use of gang-punching with interspersed master cards is generally to be 
preferred. 

5.16 The collator 

The collator will take two packs previously sorted in numerical order on 
up to 16 columns and will combine them into a single pack, so arranged that 
all cards of pack B which carry a given code-number follow immediately on 
the cards of pack A carrying the same code-number. It will also select matching 
cards from pack A and pack B, rejecting cards of pack A with code-numbers 
which do not occur in pack B and vice versa. Furthermore it can be used to 
select cards from one pack which correspond in code designation to the cards 
contained in a second pack. This last property is sometimes of value in survey 
work, since it can be used to pick out trailers associated with a given set of main 
cards. If it is so used, however, the precaution should be taken of matching 
up the rest of the trailers with the remaining main cards so as to provide a 
check that no trailers have been erroneously excluded. 

5.17 Systems of coding for punched cards 

When punched cards are used all non-numerical information will require 
to be coded in some quasi-numerical form. Each column of the card can 
represent a classification of up to twelve classes, which are denoted by X, Y, 
0, 1, 2, ... 9. 

Numerical quantities do not in general require to be coded, but may 
require rounding-off in order to economize card space and reduce the counter 
capacity needed in the subsequent tabulations. Rounding-off, however, is a 
tiresome operation and reduction of a number by a single digit should only 
be undertaken if really necessary. Often rounding-off can be avoided by 
issuing suitable instructions to the field investigators regarding the number of 
figures required in the results. 

In certain cases it may pay to code a numerical quantity by grouping. This 
is particularly advantageous when the quantity is primarily required as a basis 
of classification and does not need to be summed. If summation is needed, 
large values must not be too coarsely grouped, and there must be no " open " 
group : it is not sufficient for all the high values to be included in an " over " 
class. This often necessitates an additional column or over-punching in the 
X and Y positions. If there is to be summation the grouping must also be 
chosen so as to avoid the bias which can arise through the frequent occurrence 
of particular values (see Example 7.2.b), though such bias is relatively 
unimportant when only comparative results are required. 

The construction of a code for complicated items of information, e.g. 
questions with a large number of possible alternative answers, is often a difficult 
task, since the conflicting aims of recording the information adequately, 
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simplifying the actual task of coding, and keeping the code compact have to be 
reconciled. In many cases the most suitable form of coding for a given item 
of information can only be devised by examination of a sample of the 
questionnaires or returns. 

Compactness of coding is important not only in saving card space, but also 
in simplifying the tasks of sorting and grouping the cards into classes. With 
a two-column code, other than the type described in the next paragraph, both 
sorting and counting are more complicated than in a one-column code. 
Furthermore, classes which have been separately coded cannot be grouped for 
purposes of tabulation unless a group code is gang-punched or control is 
omitted from the code columns, since the control will automatically break at 
each change of code. 

For this reason it often pays to code an item of information in two parts, 
main and subsidiary. Thus in an agricultural survey of England and Wales 
the place location will be given by one of 61 counties or part counties. 
Instead of numbering the counties consecutively the provinces can be numbered, 
and the counties numbered consecutively within provinces. This will facilitate 
sorting and tabulation of the material by provinces, and the code will still only 
require two columns. 

Provision should always be made in a coding scheme for recording lack 
of information on items for which this contingency is likely to arise. Leaving 
the column in question blank is not satisfactory, as every occupied column 
should be punched on all cards. For the same reason numbers of under 100 
in a three- column field, for example, should commence with 0, not a 
blank. 

If there are a large number of questions to which only two or three 
alternative answers are possible the columns required on the punched card 
can be reduced by combining the answers to two or more questions in a single 
code. Thus two questions, each with the alternative answers yes, no, don't 
know, can be coded in one column by using the 9 combinations of the answers. 
Such coding, however, is decidedly more troublesome and liable to error, 
and is also less convenient for subsequent analysis. In large-scale surveys, 
therefore, it is often better to keep such questions separate even if this means 
using an additional card. 

In questions in which the answers are not mutually exclusive, multiple 
answers are usually recorded by multiple punching, i.e. punching in a single 
column the holes corresponding to all the answers given. 

Multiple punching can also be used to economize card space in other ways. 
A two-hole code, for example, is used for alphabetical coding. Unimportant 
classifications, if they contain sufficiently few classes, can be punched on different 
parts of the same column. This, however, should only be done if it is reasonably 
certain that only counts of these classifications will be required, and that they 
will not have to be used for the control of tabulations. 

In cases in which the majority of numbers in a field are below, say, 1000, 
but there are a few numbers which are between 1000 and 3000, numbers 
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between 1 000 and 1 999 can be denoted by overpunching X in the first column 
of the field and numbers between 2000 and 2999 by overpunching Y. Certain 
tabulators are iitted with a device (the 29 feature) which enables numbers 
so punched to be added directly in the course of the tabulation. If such a 
device is not available the X's and Y's will have to be separately counted and 
adjustments made. 

Multiple punching should not be used excessively. It slows up the 
punching, is difficult to verily, and introduces complications into the sorting 
and tabulations. If the data cannot conveniently be coded in some simple 
form that will go on a single card with little multiple punching, it is usually 
better to use an additional card, recombining the data as required by means 
of a reproducing punch. 

; A way of avoiding multiple punching when dealing with occasional numbers 
which exceed the allotted capacity of the field concerned is the use of trailers. 
Thus a number 2571 can be recorded in a three- column field by the use of two 
trailers, the numbers 999, 999 and 573 being punched. Apart from these 
numbers only the code number need be hand-punched on the trailers, the 
remainder of the information being gang-punched from the main card. 
Trailers must be distinguished from main cards. If this is done by punching 
and 1 respectively on some column the 1 can be used for counts ; this, 
however, requires an additional column. If X and Y are used in some occupied 
column an additional distributor will be required for counts. Alternatively 
the trailers can be removed when counting. 

Simple qualitative information can often be pre-coded on the questionnaire 
form. Thus a question to which the only alternative answers are yes, no, 
don't know, can have these answers printed on the form in conjunction with 
the numbers 1, 2, 3. The investigator is then instructed to ring the appropriate 
code number. If it is necessary to make provision for possible non-standard 
answers partial pre-coding can be used, a line being left for alternative answers 
which are subsequently coded in the office. 

Pre-coding has the advantage that the amount of office work is considerably 
reduced, since the forms can be sent for punching after scrutiny without further 
work. It must, however, be confined to questions to which the alternative 
answers can be printed on the form. It is unsuitable in the case of questions 
to which complicated and involved answers are likely to be received. If pre- 
coding is used in such cases there is a danger that the recorded answers will 
be excessively stereotyped. 

5.18 Arrangement of information on punched cards 

No serious problems of card arrangement arise when the sampling units 
are the natural units of the population and the whole of the information on a 
unit can be coded on one card. The order in which the items are arranged 
on the card is immaterial in the Hollerith system. Consequently the order 
which is most convenient for punching may be adopted. Blank columns 
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between columns which are punched should be avoided by arranging all the 
blank columns at one end of the card. 

It is generally advisable to leave a few blank columns in order to accommodate 
gang-punching of such items as index numbers and grouped classifications 
required in the course of the analysis. The number of blank columns likely 
to be required depends very greatly on the type of work. They may not be 
necessary at all in simple censuses for which the exact form of analysis can 
be prescribed at the outset. If in the course of the analysis it is found that 
additional columns are necessary, these can be made available by reproducing 
the cards, with the omission of items of information which are no longer 
required (or are not required in association with the additional columns). 

If more than one card is needed to accommodate all the relevant information 
on a unit, it is necessary that items of information that require to be associated 
in the analysis should ultimately appear on the same card. Apart from the 
code number each item of information need only be punched on one card, 
the requisite items being transferred to other cards of the set by the reproducing 
punch. In certain cases entirely new cards may require to be constituted in 
this way, but in others sufficient blank columns may be left on the cards 
originally punched to accommodate the additional items. Convenience of 
punching should still be one of the prime considerations in the arrangement 
of the original cards. It is generally better to make an additional set of cards 
after punching than to separate information which falls in a natural punching 
sequence. 

If the units which will form the basis of the analysis are not the sampling 
units, or if there is a hierarchy of units, the card arrangement presents more 
difficult problems. This situation is of fairly frequent occurrence. As an 
example we may consider suitable card arrangements for surveys of human 
populations in which the household is the sampling unit, and in which both 
households and individuals require to be treated as units in different parts of 
the analysis. 

If the survey is a simple one, in which the whole of the information relating 
to the household can be accommodated in say 20 columns, and the whole of 
the information relating to each individual in say 10 columns, it will be possible 
to accommodate all the information relating to a household of up to five 
individuals on one card, leaving 10 columns blank for subsequent use. With 
this arrangement of the card, households of 6 to 10 individuals will require 
one trailer, 11 to 15 two trailers, etc. 

This arrangement has the disadvantage that tabulations relating to individuals 
require five separate passages of the cards through the tabulator, with sorts 
between each passage, since individuals may appear in every one of five divisions 
of the card. This does not necessitate the passage of a larger total number 
of cards through the tabulator than if each individual were represented on a 
separate card, but it does result in the summaries being produced in five 
separate parts. These will then have to be combined by hand, or by punching 
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and tabulating summary cards. The total time of tabulation will also be 
somewhat increased owing to the increased number of printing and clearing 
cycles. 

The alternative is to have a separate card for each individual. This will 
in any case be necessary if the amount of information relating to individuals 
is such that more than 10-15 columns are required per individual. It is usually 
necessary to have at least some of the information relating to the household 
reproduced on each individual card. If the household information requires 
say 40 columns and the individual information 30 columns, the household and 
the first individual in it can be punched on the first card. For the subsequent 
individuals only the code number of the household need be punched, the 
remainder of the household information being gang-punched subsequently. 
For convenience of punching, the code number should be so placed on the 
card that it is contiguous to the individual information. Each individual 
should also be allotted a serial number within the household in some orderly 
sequence, e.g. head of the household, wife, children by age, and other members 
by age. If still more space is required for the household information a separate 
card or cards will have to be given over to the household, with a selection of 
this information gang-punched on the cards for individuals. 

The use of a separate card for each individual has one serious drawback. 
Although the whole of the information relating to a household and to all the 
individuals in it is recorded on the cards, it is impossible to classify households 
according to the collective characteristics of the individuals contained in them. 
We can of course pick out households containing one or more individuals 
having a given characteristic, e.g. we can select all households containing babies 
of under a year old by selecting the cards representing such babies. But we 
cannot, for example, classify the households according to number and age of 
children, unless this information has already been coded and recorded in 
summary form on the household part of the card. 

In order to enable households having given collective characteristics to be 
picked out on the sorter, it is necessary for the whole of the relevant information 
concerning all the individuals in the household to be recorded on a single 
card. If, therefore, individuals in the same household are spread over more 
than one card we must construct a new set of household cards containing the 
relevant particulars of all individuals in the household. Some upper limit 
must be imposed on size of household, households of above this size being 
dealt with by hand where necessary. Thus with a 5-column field for household 
code number, and the reservation of 15 columns for subsequent recording of 
new classifications, it is possible to allot 5 columns to each of 12 individuals. 
The construction of such cards can be effected by reproducing on to the first 
set of 5 columns of the new cards the relevant columns of all individuals 
numbered 1, together with the household code numbers, followed by all 
individuals numbered 2 and so on. When the new household cards have 
been constructed they can be classified and machine coded according 
to their characteristics by sorting and gang-punching, using master cards. 
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Finally the new codes can be transferred to new household cards, together with 
the required household information. 

This process is not necessarily simpler than the alternative of coding by 
hand from the original forms. In a small survey hand-coding will probably 
be more economical, particularly if the collective characteristics that require 
to be coded are known at the outset. Each case must be judged on its merits 
as it arises. 

A further alternative, if the data on the original forms is not easily accessible, 
is to use the tabulator to list the relevant particulars of all individuals family 
by family, the coding being carried out by hand from this list. 

5.19 General remarks on the planning of the computations 

The preceding sections indicate the potentialities and limitations of the 
various methods of handling survey material. The methods of computation 
which will be most appropriate in given cases depend not only on the type 
of material, on the scale of the survey, and on the analysis required, but also 
on the equipment and personnel available. When punched card equipment 
is not readily accessible, for example, it may be better to use alternative methods, 
even though punched card equipment would otherwise be appropriate. 

The extent to which it is advisable to carry out preliminary computations 
of index numbers, etc., on punched card equipment is also a matter which 
depends on the relative availability and cost of ordinary computing labour 
and punched card equipment. By suitable preliminary computations it is 
often possible materially to reduce the amount of work that has to be carried 
out on punched card machines. Thus, if only totals or means of multiple 
measurements taken on the same unit are required in the analysis it will clearly 
be better to obtain these before the cards are punched, punching only the 
results on the cards. On the other hand, such items as age at marriage can 
frequently be conveniently obtained from date of marriage and date of birth 
by gang-punching with interspersed master-cards. The question of the extent 
to which computations of this kind can be economically mechanized and carried 
out on punched card machines in any given circumstances is a very technical 
one, and the advice of an expert should always be sought before final decisions 
are taken. 

In certain types of survey a good deal of preliminary computation is required 
on the forms before they can be abstracted and coded. Thus in the Survey 
of Fertilizer Practice (Section 4. 23 and Example 6 . 19), which has been analysed 
on Cope- Chat cards, the applications of fertilizer are usually stated by the 
farmers in the form of so many hundredweights per acre of a given compound. 
From the dressing per acre and the chemical composition of the fertilizer 
it is necessary to work out the dressings per acre and the total dressings of the 
different plant nutrients, nitrogen, phosphate and potash. In the earlier surveys 
the weighting factors, which depended on the number of the fields and on 
the overall sampling fractions, were also introduced at this stage, both the 
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acreages and the total dressings on these acreages being raised by these factors. 
In the later surveys a method of grouping the data according to the size of the 
weighting factor was adopted. 

There is always a danger of over-mechanization in the analysis of survey 
material The use of punched cards in particular can easily lead to a stereotyped 
and uncritical form of analysis. If surveys of the investigational type are 
analysed by punched card methods provision should always be made for further 
tabulations, etc., the need for which may become apparent when the results 
of the first tabulations have been examined in detail. 

The degree to which an analysis can be planned at the outset depends 
very much on the type of survey. In a simple census type of survey the 
categories of information required may be determined at the outset by 
administrative requirements, and in this case the whole of the analysis can 
often be planned in advance. In surveys of the investigational type, however, 
it is only when the results have been examined and subjected to preliminary 
analysis "that the most appropriate form for the final analysis will become 

apparent. 

Even in surveys of the census type the information collected often forms 
a suitable basis for more detailed and critical statistical investigations. The 
possibility of such investigations should be considered at the outset when 
planning the coding of the material, so that the information will be available 
in accessible form if required. Items of information should not in general 
be omitted from the coding merely on the ground that they are ^ not required 
for the primary analysis. The general aim should be to summarize the whole 
of the relevant information in coded form, so that should new needs arise or 
should the primary analysis indicate that further analysis is likely to ^be of 
value, the work can be undertaken without re-coding or the preparation of 
new cards. 

5,20 Control of numerical accuracy in the analysis 

The attainment of a high standard of accuracy in computational work is 
extremely difficult, and demands most careful organization of the checking 
procedures and scrupulous attention to detail at all stages. Moreover a reliable 
checking system can only be devised by careful study of the types of error 
which arc likely to remain undetected. 

Numerical work may be checked by repetition, by cross checks, or by 
using different methods of computation to arrive at the same results. 
Occasionally a comprehensive check which completely checks a large piece of 
computation is available, e.g* the solution of a set of linear equations must 
satisfy these equations. Reliable comprehensive checks and checks based on 
different methods of computation, however, are unfortunately rarely available 
in census and survey work. 

Checking by repetition may consist of working over the same computation 
a second time, or carrying out the computations in duplicate. The original 
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and check computations may be carried out by the same or by different 
computers. 

In a computation which is checked by working over the figures a second 
time the main causes of error may be classified as follows : 

(1) Failure to check a value which is in error. (This includes partial failure, 
e.g. recalculation of the numerical value without checking the position of the 
decimal point.) 

(2) An identical error in both the original and check computation. 

(3) Different errors in the original and check computation which produce 
the same error in the next written or examined figure. 

(4) Failure to notice disagreement between the original and check 
computations when the original is in error. 

(5) Alteration of the original to agree with the check when the check is 
in error. 

(6) Failure to carry forward correctly corrections necessitated by a 
detected error. 

(7) Incorrect procedure in the original which is followed by the checker. 

The danger of identical errors is obviously considerably greater when the 
same computer carries out both computations. Indeed at first sight it might 
appear that if a reasonably high standard of computing is attained, the chance 
of two computers making an identical error would be somewhat remote. There 
are, however, certain errors which are particularly common, such as, for 
example, the mis-reading of a badly written figure, incorrect location of the 
decimal point, the reversal of a pair of figures, e.g. 49,876 for 48,976, and 
duplication of the wrong figure, e.g. 74,496 for 74,996. 

Duplicate computations, if properly carried out, i.e. not compared too 
frequently and corrected independently if an error is detected, will very greatly 
reduce the chances of most of the above types of error. Indeed a properly 
conducted set of duplicate computations done by different computers of 
reasonably high standard may be regarded as sufficiently accurate for almost 
all census and survey calculations. The checking of a single set of computations, 
however, even by different computers, cannot be expected to eliminate all 
errors, and if no other checks exist must be looked on as an unsatisfactory 
procedure for the more important computations. 

Fortunately in 'many types of computation we do not have to rely solely 
on checks by repetition. A great deal of census and survey analysis is subject 
to cross checks of various kinds. Thus the counts relating to each of a number 
of classifications should all add up to the same total count. The same is true 
of totals of quantitative measurements. Indeed, where cross checks of this 
kind are not available it is often best to check a set of totals by calculating 
the grand total rather than by checking every individual total. The use of a 
grand total requires only a single comparison, which can consequently be made 
with some care. If all the individual totals have to be checked, there is serious 
danger that some discrepancy may be missed. 
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The use of cross checks In place of detail checks has a further advantage 
which is not so Immediately apparent. If such a check fails to agree a good 
deal of re-computation Is usually required to locate the error. This 
automatically tends to raise the standard of computation. 

It Is important to recognize which types of error will be detected by cross 
checks and which will not be so detected. If a total check is relied on, for 
example, all entries in a table are checked, but their locations in the table are 
not checked. Tbus it is possible for quantities to be entered under the wrong 
headings. Such errors can be minimized by observing a standard order in all 
tables and always entering the values in the standard order. 

One of the main functions of any checking system is to preserve a high 
standard in the computations. Very rigorous standards of work should be 
Imposed. If more than a very few errors are found to exist a complete 
re-computation should be made. No erasures or fair copies must be permitted, 
and thorough inspection of all alterations must be made to see that errors 
are properly rectified. In large-scale routine work a record should be kept of 
the errors made by the different members of the staff. The supervisor must 
be ready at all times to resolve difficulties of procedure, otherwise the computers 
will undoubtedly attempt to resolve such difficulties amongst themselves, 
possibly Incorrectly. A high standard of neatness must be insisted on. All 
figures must be legible and unambiguous not only to the writer but to others. 
This is particularly important in coding. Confusion between 6 and 0, and 
between X and Y, gives rise to many errors. 

The coding and punching of the data of a large-scale census or survey 
presents its own organizational and checking problems. Even if a good deal 
of the information has to be coded it is advisable to record the coding on the 
questionnaire forms if possible, since transcription of pre-coded and numerical 
information is thereby avoided. If this is not possible a coding sheet may 
be used. This consists of an auxiliary printed form on which the Information 
is entered in code, the form being so arranged that it is both convenient for 
the punch operators and for use in minor hand-analyses, if these are found to 
be required. 

In certain cases the field investigators can be asked to code their own 
material at the end of each day's work. In general, however, this is not 
likely to be satisfactory, as it is difficult to preserve consistent standards of 
coding. 

If the coding is at all difficult it is best to code one or a small group of 
items of information on a batch of forms at one time, rather than to code the 
whole of each form in turn. Whatever the detailed procedure adopted, however, 
it is essential for the supervisors to carry out adequate checks to ensure that 
correct and consistent standards are maintained. 

In addition to routine checks it is often possible to impose checks of various 
kinds for gross errors and inconsistencies of coding. A special type of sorter, 
which picks out cards carrying a given code in a number of columns, can be 
used for this purpose. 
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Furthermore It is sometimes advisable to pick out extreme values and list 
the relevant particulars, so that it can be ascertained whether these values look 
reasonable. * 

In the above remarks we have been primarily concerned with the human 
element. It must not be assumed, however, that punched card equipment 
operates without error. The mechanism may fail in various ways, and auch 
failure may be momentary only. 

In order to devise adequate checks which at the same time are not unduly 
laborious, a full understanding of the mechanism is necessary. Thus, for 
example, if the control is operating on the columns on which the cards are 
sorted any mis-sort will be detected from the printed record, since the control 
will then break. Nevertheless sorting should always be given a preliminary 
check, either by passing a needle through the corresponding holes on each 
batch of cards, or, if the batches are small, by visual inspection, holding the 
batch up to a light. 

The correctness of the totals provides adequate checks for much of the 
work on the tabulator, but it must be remembered that such totals are not 
fully checked by a single run of the cards through the machine, even if sub- 
totals are accumulated on one counter and the grand total on another, since 
there may be a faulty reading of a card. Equally, if a series of progressive 
totals are taken on a counter, the fact that the grand total is correct does not 
mean that all the progressive totals are correct, since the printing mechanism 
may have printed one of the numbers incorrectly. 

The above remarks should not be taken to imply that any large number of 
errors are to be expected with punched-card equipment, but only that it must 
not be assumed that every sort and every printed figure is necessarily correct. 

Whatever the methods of calculation, the final results of every analysis 
should be carefully scrutinized for apparent inconsistencies and irregularities, 
and any anomalous values should be thoroughly investigated. 

Since the numerical material handled is itself in general subject to errors 
of various kinds, absolute numerical accuracy need not necessarily be attained 
at the early stages of the calculations. For this reason it is sometimes practicable 
to impose sample checks on such operations as punching and coding in large- 
scale work. If such checks are relied on, however, it is essential that steps 
are taken to prevent gross errors (Deming et al, 1942, B). 

Finally it should be emphasized that different types of work and different 
stages of the calculations demand very different standards of accuracy. A 
single rnisclassified card in a count, for example, will usually produce an 
entirely trivial error in the results. But the mispunching of a number 
representing a quantitative character, e.g. 610 for 010, may produce a serious 
error in the resultant mean or total of the class in which the unit falls. Errors 
in the final stages of the calculations are always likely to be more serious than 
those in the earlier stages. For this reason, in important work duplicate 
computations should be insisted on for the final stages, and these duplicates 
should themselves be used to check the typed or printed tables of the report. 

*See also Section 10.13. 127 
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5.21 The use of sampling In the statistical analysis 

In certain cases it is possible to attain the necessary accuracy and at the 
same time to reduce the volume of numerical and machine work by analysing 
a sample of the available data. Such sampling may be applied to the data 
from a complete or sample census or survey. 

At first sight the use of sampling in this manner appears illogical, since 
it might be argued that if the collection of the information on the whole of the 
population or on a large sample was justified, its inclusion in the analysis is 
also justified. This, however, is not always the case, since a complete census 
or a large sample may have been taken in order to furnish information on 
individual units or on small groups of units, while the further analysis may 
be required to elicit information which does not require to be broken down in 
detail Moreover it does sometimes happen that excessively large samples are 
taken which can well be reduced before analysis. 

As already mentioned, the analysis of a sample of the returns is also of use 
in providing preliminary results for a complete or sample census, even though 
the whole of the material will ultimately require analysis. 

Furthermore, when a large sample has been taken for administrative 
purposes, supplementary analyses of the investigational type can often best 
be undertaken on a sub-sample of the original sample. The reduction in the 
total volume of material to be handled is of particular value in such analyses, 
since they often require the application of relatively complicated statistical 
processes. Special points which emerge and on which a higher accuracy is 
desired can be re-tabulated subsequently by using the whole or a larger 
sub-sample of the material. 

The actual technique of obtaining a sample suitable for analysis is usually 
relatively simple. For many purposes a systematic sample of every qth return 
is all that is required. In some cases, however, the use of a variable sampling 
fraction is advisable. This is particularly the case in the analysis of census 
returns referring to economic institutions, factories, farms, etc,, since these 
are usually of very variable size. 

An example of an analysis of this type Is provided by the National Farm 
Survey of England and Wales (Ministry of Agriculture, 1944, G), which covered 
all holdings in England and Wales of over 5 acres. Sampling was not used in 
the survey because records were required for each Individual farm, both for 
administrative purposes and for detailed studies of small areas. A map of the 
boundaries of each farm, for example, was one item of information which 
was collected. 

For the purpose of obtaining a general summary of the results by counties, 
types of farming, etc., the analysis of the whole of the material was unnecessary. 
The holdings were therefore divided into size-groups and a systematic sample 
stratified for counties and size-groups was taken, using a variable sampling 
fraction for size-groups. The sampling fractions, and numbers of holdings 
in the population and in die sample, are shown in Table 5.23. 
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TABLE 5 . 21 ANALYSIS OF THE NATIONAL FARM SURVEY : 

CONSTITUTION OF SAMPLE 



Size-group 
(acres) 


Average size 
(acres) 


No. of 
holdings 


Sampling 
fraction 
(per cent.) 


No. of 
holdings 
in sample 


5- 25 


12 


101,450 


5 


5,072 


25-100 


55 


111,360 


10 


11,136 


100-300 


165 


65,210 


25 


16,302 


300-700 


413 


11,150 


50 


5,575 


Over 700 


1,035 


1,430 


100 


1,430 






290,600 


(13-6) 


39,515 



Had a uniform sampling fraction been used in place of a variable sampling 
fraction, a sample over twice as large would have been required to give results 
of the same accuracy on such items as the percentage of land under different 
systems of tenure. By the use of a variable sampling fraction results of ample 
accuracy were obtained from an analysis covering only one-seventh of all the 
holdings. This not only considerably reduced the amount of coding and 
machine work, but also enabled work to proceed as soon as the information 
for the sample farms had been assembled and abstracted. In consequence 
it was possible to make the results of the analysis available a year or two sooner 
than would have been the case had the whole of the material had to be abstracted 
before analysis. * 

5.22 Adjustment of the results to compensate for defects in the sample 

When the sampling procedure is defective in one respect or another, 
attempts are sometimes made to adjust the results in order to compensate for 
the defects. Thus it may happen that owing to defects in the selection of the 
sample or in the collection of the information, different classes of the population 
are found to be represented in incorrect proportions in the final sample. In 
such cases it is possible to adjust the results by weighting the different classes 
in such a manner as to compensate for the errors in the proportions. 

This procedure must be clearly distinguished from the procedure of 
stratification after selection mentioned in Section 3.3. The validity of the 
latter procedure depends on the fact that the sample as a whole is random 
and therefore the selection from within strata is also random. If the 
proportions in the different classes are different because of defects in the 
sampling procedure, however, it is most unlikely that the selection from within 
these classifications will be fully random. Any adjustment of the type envisaged, 
therefore, although it may somewhat improve matters, must not be expected 
to eliminate by any means the whole of the defects. 

* A further example is provided by the 1 per cent, sample of the 1951 Census of the 
United Kingdom, described in Section 10.16 (1952, C'). 
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Stratification after selection is a special case of the use of supplementary 
information of all kinds. Such adjustments, whether planned at the outset 
or decided on subsequently after examination of the data, are quite justified. 
The essential difference between these adjustments and between adjustments 
of the same type made in order to compensate for defects in the sampling 
procedure is that in the former the selection is random, except for permissible 
restrictions, whereas in the latter it may be biased in various ways. 

In general, if the sampling procedure is defective it is best to report the 
results obtained without adjustment. At the same time data should be given 
indicating, so far as is possible, the deviations of the sample from the expected 
distributions. Thus if the proportions in the different classes of a classification 
are known for the population these may be presented alongside the parallel 
classification of the sample. Similarly the sample means of quantities for 
which the population means are known may be presented for comparison. 
Occasionally an adjustment of some of the more important values derived from 
the sample may be considered worth while, but in such cases the unadjusted 
results should also be presented. 

The above remarks apply primarily to samples for which the sampling 
procedure is markedly defective. In cases in which there are slight defects, 
such as a minor degree of non-response, the application of some small 
adjustment, if this appears necessary, is more justified. If such adjustments 
are made, however, the fact should be clearly stated and their magnitude 
should be indicated. 

The simplest way of dealing with non-response is to regard the non- 
respondents as similar to the remainder of the sample, i.e. to treat the sample 
as if it were a sample on a smaller number of units. With a stratified sample, 
the non-respondents in each stratum can be treated as the equivalent of 
respondents in that stratum.* Alternatively some other appropriate classification 
can be used, as indicated above. 

If follow-up methods have been used and there has been a good response 
to the follow-up, initial non- respondents who subsequently respond can be 
treated as a sub-sample of all initial non-respondents and weighted accordingly. 
It is clear that if there is any difference between respondents and non-respondents 
the final non-respondents may be expected to be more like the initial non- 
respondents than the general population. This procedure was first, so far as 
I know, suggested by Professor D. V. Glass, and was used by him in the 
analysis of the Family Census (Section 4.10). 

In this survey those who failed to provide the enumerator with the required 
information were sent a letter further explaining the purposes of the survey 
and requesting that the form be sent direct to the Royal Commission. Of the 
230,000 initial non-respondents (i.e. 17 per cent, of the whole sample), 50,000 
responded to this appeal. This 50,000 therefore constituted a sample (though 
a non-random one) of the 230,000, and the first 12,000 of the 50,000 replies 
were combined with the remainder of the sample with a weight of 230/12. 

* See Section 10.3 for a method of carrying out the adjustments in a large survey. 
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This procedure was found to give overall birth-rates which corresponded 

very closely to those already known from other sources, whereas the original 
sample gave birth-rates which were substantially too high, owing to the fact that 
the majority of the Initial non-respondents were women with few or no children. 

5.23 Critical analysis of survey data 

At the outset a clear distinction must be made between the types of deduction 
that can be made with certainty from survey data and the types that are 
speculative. If in a nutrition survey, for example, we find that children of 
large families are worse fed than children of small families we can draw the 
definite conclusion that size of family is associated with malnutrition of the 
children, and we can give quantitative estimates of the degree of malnutrition 
actually existing amongst children of families of different sizes. We cannot, 
however, assert with certainty that size of family is the cause of this malnutrition, 
though the fact that in large families the income per head is automatically 
less if there is a fixed total Income would lead us to expect an underlying causal 
relationship. 

Even in situations where a definite causal relationship is known to exist, 
deductions as to the magnitudes of the effects of given factors can never be 
made with certainty from survey data. We may, for instance, find that fields 
receiving fertilizers give higher yields per acre than fields without fertilizers. 
Yet we cannot attribute the observed differences solely to differences in fertilizers. 
The farmers using the fertilizers may be farming better land, they may be 
growing higher-yielding varieties, and they may be carrying out their farming 
operations with greater skill. 

Clearly definable extraneous factors which may influence the estimates of 
the effects of other factors can be determined in the course of the survey. 
Under certain circumstances the disturbance due to them can be eliminated 
by methods of analysis which will be outlined in this and the following section. 
But there will always be other undetermined and possibly unascertainable 
factors which cannot be taken into account. 

In order to determine with certainty the magnitude in the causal sense of 
the effect of any given factor, experiments must be undertaken. Surveys 
cannot be regarded as satisfactory substitutes for experiments. Nevertheless 
they are of value in situations in which experiments are difficult or impossible, 
though in such cases all conclusions must be tentative. They are also of value 
as a preliminary to experimental work, since they frequently indicate the 
factors that are likely to be most worth investigation. 

If, however, survey data are to be effectively used for either of these two 
purposes it is important to have means of eliminating the effects of extraneous 
factors in so far as this is possible. 

A simple example will illustrate the problem involved. Table 5. 23. a 
gives the numbers of fields, totals and means of yields per acre of a sample of 
901 potato fields classified according to (a) the five regions into which the 
country was divided, and (b) the five varieties included in the survey. These 
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results were obtained in the course of an investigation into the blackening of 
potatoes on cooking, the data on yields being collected from the farmers when 
the samples were taken, together with a considerable amount of information 
on fertilizers, cultural practices, etc. Approximately 180 fields were selected 
in each region. The selection within regions was not strictly random, but can 
be regarded as substantially so for the purpose of the present discussion. The 
sample was confined to the five named varieties, but was not stratified by 
varieties. 

From the results of Table 5 . 23 . a it is apparent that the mean yield is highest 
for the Scottish region, and is also higher for the Northern region than for the 
remaining regions. There are, however, even larger varietal differences in 
yield. Consequently if the varieties were grown in different proportions in 
the different regions the regional differences are likely to be influenced by 
varietal differences. 

To examine this point it is necessary to construct the two-way classification, 
regions X varieties. This is shown in Table 5.23.b. The values of 
Table 5. 23. a appear as marginal totals in this table. 

TABLE 5. 23. a POTATO SURVEY: NUMBERS OF FIELDS AND TOTALS AND MEANS 

OF THE YIELDS PER ACRE (TONS) 
(a) Classified by regions 





No. 


Total 


Mean 


Scotland 


174 


1,482 


8-52 


North 


177 


1,425 


8-05 


E. Midlands 


189 


1,415 


7-49 


South 


182 


1,324 


7-27 


"West . 


179 


1,368 


7-64 


ALL . 


901 


7,014 


7-78 



(b) Classified by varieties 





No. 


Total 


Mean 


Majestic 


393 


3,292 


8-38 


King Edward 


250 


1,563 


6-25 


Great Scot 


56 


461 


8-23 


Arran Banner 


84 


766 


9-12 


Kerr's Pink 


118 


932 


7-90 


ALL 


901 


7,014 


7-78 
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TABLE 5.23.b POTATO SURVEY: TWO-WAY CLASSIFICATION OF THE DATA BY 

REGIONS AND VARIETIES 
Numbers of fields 





Scot. 


North 


E. Mid. 


South 


West 


Total 


Majestic . 


37 


75 


104 


101 


76 


393 


King Edward . 


42 


14 


85 


66 


43 


250 


Great Scot 


18 


14 





6 


18 


66 


Arran Banner . 


8 


38 





9 


29 


84 


Kerr's Pink 


69 


36 


_ 





13 


118 


TOTAL 


174 


177 


189 


182 


179 


901 



Totals of yields per acre (tons) 





Scot. 


North 


E. Mid. 


South 


West 


Total 


Majestic . 


350 


614 


876 


823 


629 


3,292 


King Edward . 


321 


80 


639 


387 


236 


1,563 


Great Scot 


166 


106 


_ 


49 


140 


461 


Arran Banner . 


73 


351 


_ 


65 


277 


766 


Kerr's Pink 


572 


274 


_ 





86 


932 


TOTAL 


1,482 


1,425 


1,415 


1,324 


1,368 


7,014 



Means of yields per acre (tons) 





Scot, 


North 


E. Mid, 


South 


West 


All 


Majestic . 


9-46 


8-19 


8-42 


8-15 


8-27 


8-38 


King Edward . 


7-65 


5-71 


6*34 


5-87 


5-49 


6-25 


Great Scot 


9-22 


7-57 





8-17 


7-78 


8-23 


Arran Banner . 


9-12 


9-24 


_ 


7-22 


9-54 


9-12 


Kerr's Pink 


8-29 


7-61 








6-61 


7-90 


Aix. 


8*52 


8-05 


7-49 


7-27 


7-64 


7-78 
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From the first part of this table (numbers of fields) it is apparent that the 
distribution of varieties is by no means the same for all regions. In particular 
very little King Edward, which yields about 2 tons per acre less than the other 
varieties, was grown In the Northern region. The relatively high yield of the 
Northern region is therefore accounted for, in part at least, by varietal 
differences. 

To make an estimate of what the differences between regions would be if 
the proportions of the different varieties were the same in all regions we can 
compare the regional means of the individual varieties. These are given in 
the last part of Table 5.23.b. Inspection shows that Scotland gives higher 
yields than every other region for all varieties except Arran Banner, of which 
there are only 8 fields in Scotland. The Northern region, on the other hand, 
does not show any consistent differences from the other English regions. 

Inspection of this kind may in certain cases be all that is necessary. 
Frequently, however, quantitative estimates of the differences attributable to 
one classification when freed from the effects of a second classification are 
required. Such estimates may be obtained in various ways, depending on the 
nature of the table and which differences are of interest. 

(1) Unweighted means of sub-class means 

If all the sub-class means are of adequate accuracy the marginal unweighted 
means of these means can be taken. These unweighted means will give 
estimates of the differences attributable to either classification when the units 
of each class are equally divided between the classes of the other classification. 

This method cannot be applied to the whole of Table 5. 23. a because of 
the unoccupied cells, but it can be applied to the parts of the table represented 
by all varieties in the Scottish, Northern and Western regions, or all regions 
for varieties Majestic and King Edward, The results are shown in Table 5 . 23 . c. 
For comparative purposes each set of means has been adjusted by adding a 
constant amount so that the mean of the set is equal to the general mean. 
The similarity of the Northern region with the other English regions is 
confirmed. The yield of Kerr's Pink has also been reduced relative to the other 
varieties. This is a consequence of the high proportion of Kerr's Pink in 
Scotland. 

(2) Standardization of proportions by weighted means of sub-class means 
This is similar to Method 1. The weighted means of columns (or rows) 

of the table of sub-class means are taken, with weights roughly in proportion 
to the numbers in each row (or column) class in the whole sample. 

An example of this method is given in Example 6.8. It is not applicable 
in full to Table 5.23.b on account of unoccupied cells, 

It should be noted that somewhat different quantities are estimated by the 
two methods. Failure to recognize this fact sometimes causes a certain amount 
of confusion. If applied to a varieties X regions table, for example, Method 1 
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estimates the differences between regions that would occur in the hypothetical 
situation in which equal numbers of fields of all varieties were grown in each 
region. Method 2 gives estimates appropriate to the hypothetical situation 
in which the different varieties are grown in the same proportions in all regions, 
these proportions being equal to the average proportions for the whole country. 
Only if the differences between the different varieties are the same for all 
regions will the estimated differences be the same. 

Method 2 has two advantages over Method 1. It is in general more 
accurate, since greater weight is on the average given to the cells containing 
the greater numbers of units. It also gives estimates which refer to a 
hypothetical situation more in conformity with that actually existing. 

TABLE 5.23.c POTATO SURVEY: UNWEIGHTED MEANS OF SUB-CLASS MEANS 
(a) OMITTING E. MIDLANDS AND SOUTHERN REGIONS, (b) OMITTING GREAT 
SCOT, ARRAN BANNER AND KERR'S PINK 





Unadjusted 
means 
(a) (b) 


Means adjusted 
to sample mean 
(a) (b} 


Scotland . 
North 
E. Midlands 
South 
West 


8-75 8-56 
7-66 6-95 

7-38 
___ 7.Q1 

7-54 6-88 


8-55 8-98 
7-46 7-37 
7-80 
__ 7-43 
7-34 7-30 


Mean 


7-98 7*36 


7-78 7-78 


Majestic . 
King Edward . 
Great Scot 
Arran Banner . 
Kerr's Pink 


8-64 8-50 
6-28 6-21 
8-19 
9-30 
7-50 


8-44 8-92 
6-08 6-63 
7-99 
9-10 
7-30 


Mean 


7-98 7-36 


7-78 7-78 



It will be recognized that the marginal means of the sub-class means, 
whether weighted or unweighted, do not contain the whole of the information. 
If variety P yields more than variety Q in one region and less in another, for 
example, this fact can only be established from the sub-class means. Under 
such circumstances any comparison of the regions based on equalization of 
the proportions of the varieties represents an over-simplification of the real 
situation. 

Neither Method 1 nor Method 2 can be applied to the whole of tables 
in which there are blank cells. Even if there are no blank cells neither method 
will be very satisfactory when there are certain cells which contain so few 
units that the corresponding cell means are very inaccurate. Thus in the 
present example the relatively large difference between Scotland and the 
Northern region in column (b) of Table 5.23.C, in contrast to column (a), 
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is in part due to the fact that Arran Banner has given a larger yield in the 
Northern region than in Scotland. This may well be due to sampling errors, 
since there are only 8 fields of Arran Banner in Scotland. 

There are two further relatively simple methods which are of value in such 
circumstances. 
(3) Weighted means of differejices of sub-class means 

If only two classes (cross-classified by a second classification into a number 
of sub-classes) are to be compared, a weighted mean of the differences of each 
pair of sub-classes can be taken. Maximum accuracy will be attained when 
the weights are inversely proportional to the squares of the standard errors 
of these differences. * 

With independent samples in each sub-class the square of the standard 
error of a difference is equal to the sum of the squares of the standard errors 
of the two means (Section 7.5). Under certain circumstances, which will 
be apparent from a study of Chapter 7, and in particular when the selection 
from within sub-classes is effectively random and the standard deviation per 
unit is constant, the standard errors are inversely proportional to the square 
roots of the numbers of units in the sub-classes (Section 7.1). In this case 
the reciprocals of the weights must be taken proportional to the sums of the 
reciprocals of the pairs of sub-class numbers, i.e. if % and n z are a pair of 
sub-class numbers the weight can be taken equal to zo, where 



W 



The calculations are shown in Table 5.23.d. The weight for Majestic, for 
example, is given by l/w = 1/37 + 1/75. Weights may be taken to the nearest 

TABLE 5. 23. d POTATO SURVEY: ESTIMATE OF DIFFERENCE OF SCOTTISH AND 
NORTHERN REGIONS FROM WEIGHTED MEAN OF VARIETAL DIFFERENCES 





Difference 


Weight 






z 


w 


wz 


Majestic 
King Edward 
Great Scot 


-f 1-27 
4- 1-94 
+ 1-65 


25 
10 
8 


4- 31-75 
4- 19-40 
4- 13-20 


Arran Banner 


-0-12 


7 


0-84 


Kerr's Pink . 


+ 0-68 


24 


4- 16-32 


Total . 





74 


4- 79-83 


Weighted Mean 


4- 1-08 







whole number. They can be rapidly calculated from a table of reciprocals 
or on a slide rule. The weighted mean, 4- 1'08, is obtained by dividing the 
total of wz by the total of w. 

* See Sections 9.6 and 9,7 for further examples of this procedure. 
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(4) Pooling of classes 

In cases in which inspection or use of the previous methods indicates that 
the differences between certain of the classes are small, such classes can be 
pooled. This often eliminates blanks and very small numbers In the sub-class 
table. 

In the present example inspection indicates that there is little difference 
between the four English regions. This is confirmed by the means in 
Table 5.23.C. These regions may therefore be pooled. This pooling will 
permit a better estimate of the^ differences between the last three varieties than 
that given by Table 5.23.C. After pooling Scotland can be included at 
^ weight, following Method 2. 

The calculations are shown in Table 5.23.e. It will be noted that the 
pooling can be effected by adding the numbers of fields and totals of yields 
for the four English regions from Table 5.23.b. 

TABLE 5.23.e POTATO SURVEY: EFFECT OF POOLING ENGLISH REGIONS 





English regions (pooled) 


Scotland : 

Mean 


Weighted 
mean 


No. Total Mean 


Majestic 
King Edward . 
Great Scot 


356 2,942 8-26 
208 1,242 5-97 
38 295 7-76 


9-46 
7-65 
9-22 


8-50 
6-31 
8-05 


Arran Banner . 


76 693 9-12 


9-12 


9-12 


Kerr's Pink 


49 360 7-35 


8-29 


7-54 


ALL .... 


727 5,532 7-61 


8-52 


7-79 


Weight 


4 


1 





5.24 Method of fitting constants 

If there are more than two classes between which differences are required, 
Method 3 can be used to compare each pair separately. It will not, however, 
give a consistent set of estimates, i.e. the sum of the estimated differences 
between A and B and between B and C will not exactly equal that between 
A and C. This is a reflection of the fact that Method 3, though fully efficient 
if there are only two classes in the table, is not fully efficient (in the statistical 
sense) when there are more than two classes. In this case, in addition to direct 
comparisons between A and B, indirect comparisons of A with C and C with J5, 
etc., can be made. When the sub-class frequencies are not proportionate these 
will contibute some additional information. To take an extreme case, if variety 

137 5* 



SECT. 5.24 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

P occurs in regions A and C only, and variety Q in regions B and C only, 
comparisons between regions A and B with differences between varieties 
eliminated can only be made by indirect comparisons involving region C. 

Estimates of maximum accuracy, which, as might be expected, are also 
consistent, can be obtained by fitting constants by the method of least squares 
(Yates, 1934, A ; Snedecor, 1934, A ; Snedecor and Cox, 1935, A), Snedecor 
has drawn attention to the fact that if the numbers of units in the 
different sub-classes are nearly proportionate, Method 2 can be used without 
appreciable loss of information in place of the more laborious method of 
fitting constants. 

Stevens (1948, A) has given a simple arithmetical method of obtaining the 
values of the estimates derived by fitting constants. The procedure, which is 
one of successive approximation, is illustrated for the data of our example 
in Table 5.24. 



TABLE 5.24 POTATO SURVEY: ESTIMATION OF VARIETAL AND REGIONAL 

DIFFERENCES BY FITTING CONSTANTS 







Sc. N. E.M. S. W. 


Starting values 


Final 








and corrections 


values 




901 


174 177 189 182 179 


(1) 


(2) (3) 


(4) (5) 


Maj. 


393 


37 75 104 101 76 


+ -60 +-10 +-02 


+ -72 8 -50 


K.E. 


250 


42 14 85 66 43 


-1-63 -00 +-02 


-1-51 6-27 


G.S. 


56 


18 14 6 18 


+ -45 -08 -03 


+ -34 8-12 


A.B. 


84 


8 38 9 29 


+ 1-34 +-16 -00 


+ 1-50 9-28 


K.P. 


118 


69 36 . 13 


+ -12 --39 -08 


- -35 7-43 








A A 


(1) 


+ 74 +-27 --29 --51 --14 






Starting (2) 


+ 09 -48 +-36 +-14 -16 


^ 




values 








and (3) 


+ 83 '21 +-07 '37 '30 














corrections (4) 


+ 13 +-01 06 -*06 03 


-4 










(5) 


+ 03 +-01 -02 -02 -00 


-i/ 








Final (6) 


+ 99 --19 01 ~*45 --33 


values (7) 


8-77 7-59 7-77 7-33 7-45 



The number of fields in each sub-class, reproduced from Table 5.23.b, 
is shown in the body of the table, with marginal totals above and to the left. 
Column 1 and row 1 give the deviations of the varietal and regional mean 
yields from the general mean 7-78. These are obtained from Table 5. 23. a 
or 5.23.b. Thus, for Majestic, 8-38 7-78 = + 0-60. These data are all 
that are necessary for the estimation process. 

The approximation should be started with the set of deviations which 
show the biggest differences, in this case the varietal deviations. We first 
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calculate what the regional deviations would be with these varietal deviations 
if there were no regional differences. These regional deviations are shown 
with signs reversed in row 2. To obtain them the numbers of fields in each 
column are multiplied in turn by the deviations in column 1, summed and 
divided by the total number of fields in each column. Thus for Scotland 
the deviation is 

(+ -60 X 37 - 1-53 X 42 + -45 X 18 + 1-34 x 8 + -12 x 69)/174 

= _ 14-96/174 = -09 

and similarly for the other columns. It is best to record the sums of products, 
14-96, etc., before division, as this provides a check against minor errors, 
the total of these sums of products being equal to the sum of the products of 
column 1 and the total column of numbers of fields. This sum, + 5*22, 
differs from zero only because of rounding-off errors. 

Since the observed deviation for Scotland is + -74, and the expected 
deviation if there were no regional differences is -09, a first estimate of the 
true regional deviation for Scotland with varietal differences eliminated is 
+ '74 ( - -09) = + "83. These estimates are shown in row 3, which is 
obtained by adding row 2 to row 1. 

The varietal deviations in column 1, however, are themselves affected 
by regional differences. These may now be corrected for by the same process, 
using the estimates of the regional deviations just obtained. To do this the 
values of row 3 are multiplied in turn by the numbers of fields in each row, 
summed and divided by the total number of fields in the row. Thus for 
Majestic we obtain, after reversal of sign, the correction + '10. These 
corrections are shown in column 2. The checks operate as before. 

The corrections in column 2 could now be added to the values of column 1 
to give second approximations to the varietal differences. It is simpler, however, 
to use the corrections themselves to calculate corrections to the estimated 
regional differences of row 3. These are shown in row 4. The same procedure 
of calculation is followed, and the signs are reversed as before. 

The values of row 4 are then used to give a second set of varietal corrections, 
which are shown in column 3, and these in turn are used to give a third set 
of regional corrections, which are shown in row 5. 

The process may be stopped at this point, since the corrections are now so 
small that the next set will be negligible. Columns 1, 2 and 3 and rows 3, 4 
and 5 are therefore summed to give the final estimates of the deviations, shown 
in column 4 and row 6. These deviations may then be added to the general 
mean 7*78 to give final estimates of the varietal means freed from regional 
differences (column 5), and of regional means freed from varietal differences 
(row 7). Final checks are obtained by forming the sums of products of these 
means with the total numbers of fields and dividing by the grand total 901. 
In each case the general mean 7-78 should be obtained. 

It will be seen that the final values do not differ greatly from the corresponding 
values of Tables 5.23.C, 5.23.d and 5.23.e, but they do differ substantially 
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from the values of Table 5. 23. a. The differences between Scotland and the 
Northern region, for example, are as follows : 

Regional means over all varieties (Table 5. 23. a) .. .. 0-47 
Unweighted means of sub-class means (Table 5.23.c) j ^t " " . g. 

Weighted differences (Table 5 . 23 . d) 1-08 

Fitting constants (Table 5.24) i-18 

Equally the varietal means given in Table 5.23.e are very close to those of 
Table 5.24. 

This demonstrates that the more direct methods, used with judgment, 
are capable of giving satisfactory estimates. As is shown in Example 7.5, 
where the errors of these estimates are discussed, the third of the above regional 
differences, viz. 1-61, is decidedly less accurate than the others, since the 
information provided by the last three varieties is not taken into account. 
On the other hand, the agreement of the above estimates must not be taken 
as providing any indication of their real accuracy. Since all the estimates are 
based on the same data, any disagreement is primarily a reflection of the effects 
of the various approximations on the efficiency of the estimation process. 

There are some further general points in connection with the method of 
fitting constants which should be noted. 

The method will provide efficient estimates of the differences due to one 
classification, freed from the effects of the other classification, if the true 
differences are the same for all classes of the second classification. In the 
terminology of the design of experiments, this is equivalent to the non-existence 
of interactions. If the true differences vary markedly, the method is 
inappropriate. Instead the individual differences should be considered or 
Method 1 or Method 2 should be used. (See sub-section (2) above.) 

As mentioned above, if one of the classifications consists of two classes 
only, Method 3 is fully efficient for estimating the difference between these 
two classes. If estimates of the differences between the classes of the second 
classification are required, these may be derived by adjusting the class means 
in the manner followed in Table 5.24, assigning deviations of plus and minus 
half the difference to the two classes of the first classification. The method 
is in this case exact, and therefore 'no further approximations are required. 

The method can be extended to multiple classifications having three or 
more sets of classes. In this case the data required are the general means for 
each main classification together 'with the two-way tables of numbers of units 
corresponding to each pair of classifications. The numbers of units in the 
individual cells of the three- or more-way table are not required. If there 
are only two classifications the general means for each main classification, 
together with the numbers of units in the individual sub-classes, are required. 

The reporting of sub-class numbers of the relevant pairs of classifications 
should therefore be considered even in cases in which the reporting of the 
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separate sub-class means or the calculation of adjusted estimates is not considered 
worth while. If these numbers are available the various inter-relations may 
then be studied subsequently without further reference to the original material. 

5.25 Preparation of reports 

When the analysis of a census or survey has been completed it is usually 
necessary to embody the results in a report. In addition to the presentation 
of the numerical results in the form of tables and graphs, some discussion and 
interpretation Is also required. 

The lines to be followed in the preparation of such reports vary greatly 
according to the nature of the material and the purposes of the report, but 
are in general similar for sample censuses and surveys and for complete censuses 
and surveys. We therefore do not propose to discuss them here. 

There are, however, certain matters which should be reported on in a sample 
census or survey which do not arise (or are of less importance) in a complete 
census. These matters have been covered by the memorandum (already 
referred to in Section 1 . 6) prepared by the United Nations Sub-Commission 
on Statistical Sampling, entitled Recommendations concerning the Preparation 
of Reports on Sampling Surveys. The recommendations* are as follows : 

(1) General description of the survey 

The general description of the survey should include information on the 
following points. Some of these will require fuller treatment in the more 
detailed technical sections of the report. 

(a) Statement of purposes of the survey. A general indication should be 
given of the purposes of the survey and the ways in which it had been 
expected that the results would be utilized. 

(b) Description of the material covered. An exact description should be 
given of the geographical region and the categories of material covered 
by the survey. In a survey of a human population, for example, it is 
necessary to specify whether such categories as hotel residents, institutions 
(e.g. boarding houses, sanatoriums), vagrants, military personnel, were 
included. The reporter should guard against any possible mis- 
apprehension regarding the coverage of the survey. 

(c) Nature of the information collected. This should be reported in 
considerable detail, including a statement of items of information 
collected but not reported on. The inclusion of copies of the schedules 
and relevant parts of the instructions used In the survey (including 
special rules for coding and classifying) is often of value. If this is 
impracticable, it may be possible to make available a limited number of 
copies which may be obtained on request. 

* The introduction to the memorandum is omitted. 
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(d) Method of collecting the data. Whether by interviewers, investigators, 
mail, etc. 

(e) Sampling method. An indication should be given in general terms of 
the type of sampling adopted, the size of the sample, the proportion 
it forms of the material covered, and arrangements for follow-ups, if 
any, in cases of non-response. 

(/) Accuracy. A general indication of the accuracy attained should be 
given. 

(g) Repetition. State whether the survey is an isolated one undertaken 
without intention of repetition, or is one of a series of similar surveys 

(h) Point or period [of time]. Point or period of time to which the data 
refer. 

(i) Date and duration. The starting date and period taken for the field 
work. 

(/) Cost. An indication should be given of the cost of the survey, under 
such headings as preliminary work, field investigations, analysis, etc. 

(k) Responsibility. The name of the organization sponsoring the survey 
and of the one responsible for conducting it. 

(/) References. References should be given to any published reports or 
papers. 

(2) Design of the survey 

The [sampling] design of the survey should be carefully specified.* 

(3) Method of selecting sample-units'\ 

The reporter should describe the procedure used in selecting sample-units, 
and if it is not a random selection he should indicate the evidence on which 
he relies for adopting an alternative procedure. Purposive selection and quota 
sampling cannot be regarded as equivalent to random sampling. 

(4) Personnel and equipment 

It is desirable to give an account of the organization of the personnel 
employed in collecting, processing and tabulating the primary data, together 
with information regarding their previous training and experience. Arrange- 
ments for training, inspection, supervision, and methods of processing data 
should be explained, as also should methods of checking the accuracy both of 
the primary data and of the processing. A brief mention may be made of the 
equipment used in processing the data. 

* A section on terminology follows. 

f The first paragraph, defining random and systematic processes of selection, is 

omitted. 
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The critical observations of the technicians in regard to any part of the 
survey should be given. These observations will help others to improve their 
operations. 

(5) Costs 

An important reason for the use of sampling (instead of complete 
enumeration) is lower cost. Information on costs is therefore of great interest. 
Costs should be classified so far as possible under such heads as preparation 
(showing separately the cost of pilot studies), field work, supervision, processing, 
analysis, and overhead costs. In addition, labour costs in man- weeks of 
different grades of staff, and also time required for interview and journey 
time and transport costs between interviews, should be given. The compilation 
of such information, although often inconvenient, is usually worth undertaking 
as it may suggest substantial economies in the planning of future surveys. 
Efficient design demands a knowledge of the various components of cost, as 
well as of the components of variance. 

(6) Accuracy of the survey 

(a) Precision as indicated by the random sampling errors deducible from 
the survey. Standard deviations of sample-units should be given in 
addition to such standard errors (of means, totals, etc.) as are of interest. 
The process of deducing these estimates of error should be made 
entirely clear. This process will depend intimately on the design 
of the sample survey. An analysis of the variances of the sampling-units 
into such components as appear to be of interest for the planning of 
future surveys is also of great value. 

(b) Degree of agreement observed between independent investigators 
covering the same material. Such comparison will be possible only 
when interpenetrating samples have been used, or checks have been 
imposed on part of the survey. It is only by these means that the 
survey can provide an objective test of possible personal equations 
(differential bias among the investigators). 

(c) Other non-sampling errors, (i) Errors which are common to all 
investigators, and indeed any constant component of error (or " bias ") 
in the recorded information, will not be included in the estimates of 
the random sampling errors deducible from the survey results, 
(ii) Another source of error of the same type is that due to observation 
of quantities which do not correspond exactly to the quantities of which 
estimates are required : in a crop-cutting survey, for example, the 
yields of the sample plots give estimates of the amount of grain, etc., 
in the standing crop, whereas the final yield will be affected by losses 
at harvest, (iii) The possible effects of such errors on the accuracy 
of the results, and of incompleteness in the recorded information 
(e.g. non-response, lack of records, whether covering the whole of the 
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survey or particular areas or categories of the material), should therefore 
be fully discussed, (iv) Any special checks instituted to control and 
determine the magnitude of these errors should be described, and the 
results reported. 

(d) Accuracy, completeness and adequacy of the frame. The accuracy of 
the frame can and should be checked and corrected automatically in 
the course of the enquiry, and such checks afford useful guidance for 
the future. Its completeness and adequacy cannot be judged by internal 
evidence alone. Thus complete omission of a geographic region or 
the complete or partial omission of any particular class of the material 
intended to be covered cannot be discovered by the enquiry itself, 
and auxiliary investigations have often to be made. These should be 
put on record, indicating the extent of inaccuracy which may be 
ascribable to such defects. 

(e) Comparison with other sources of information. Every reasonable 
effort should be made to provide outside comparisons with other 
sources of information. Such comparisons should be reported along 
with the other results, and the significant differences should be discussed. 
The object of this is not to throw light on the sampling error since 
a well-designed survey provides adequate internal estimates of such 
errors but rather to gain knowledge of biases, and other non-random 
errors. 

(/) Efficiency, The results of a survey often provide information which 
enables investigations to be made on the efficiency of the sampling 
designs, in relation to other sampling designs which might have been 
used in the survey. The results of any such investigations should ^e 
reported. To be fully relevant the relative costs of the different 
sampling methods must be taken into account when assessing the 
relative efficiency of different designs and intensities of sampling. 
Such an investigation can be extended to consideration of the relation 
between the cost of carrying out surveys of different levels of accuracy 
and the losses resulting from errors in the estimates provided. This 
provides a basis for determining whether the survey was fully adequate 
for its purpose, or whether future surveys should be planned to give 
results of higher or lower accuracy. 
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CHAPTER 6 

ESTIMATION OF POPULATION VALUES 

6.1 Possibility of alternative estimates 

In this chapter we shall deal with the derivation of estimates of the 
population values from the numerical results obtained in the sampling. A 
simple example of such an estimate is provided by the arithmetic mean of the 
sample values of a random sample. It is well known that this mean provides 
an estimate of the mean of the population from which the sample was drawn, 
though it will not, owing to sampling errors, be exactly equal to the mean 
of the population. 

The arithmetic mean of the sample values is not the only possible estimate 
of the population mean. We might, for instance, take the median, i.e. the 
central value, or the geometric mean, i.e. the antilogarithm of the mean of the 
logarithms of the sample values, or even the mean of the highest and lowest 
values in the sample. 

In addition to estimates such as the mean and the median, which can be 
derived from a given set of values independently of any supplementary 
information associated with these values, there are further alternative estimates 
which can be derived by taking account of such supplementary information 
as is available, either qualitative or quantitative. Thus, as has been mentioned 
in Section 3.3, if the numbers of units from the whole population falling in 
the different strata of some stratification are known, a random sample can be 
adjusted so that the different strata are represented in their correct proportions. 
Similarly, supplementary information on a quantitative character can be used 
in various ways to provide estimates which will in general be more accurate 
than the simpler estimates which do not utilize this information. 

In deciding which is the best estimate for any given type of sampling three 
different criteria have to be considered. These are, absence of bias, accuracy 
(or, as it is technically known, efficiency), and computational convenience. 
In the case of a random sample, if the population values are normally 
distributed the meaning of this term is explained in Chapter 7 the 
arithmetic mean will provide an estimate which is both free from bias and, 
apart from supplementary information, of maximum accuracy. It is also 
sufficiently simple computationally for practical use. More important, the 
mean will remain an unbiased estimate of the population mean whatever the 
form of the distribution of the population values, though it will not necessarily 
be the most accurate estimate that could be devised. The mean also has the 
incidental advantage that the sampling errors to which it is subject can be 
relatively easily assessed, and are not greatly dependent on the form of the 
distribution of the population values. 

In this book we do not propose to do more than list what appear to be 
the most useful estimates for any given type of sampling, and give examples 
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of the computations involved. Any discussion of questions of bias and relative 
efficiency requires advanced mathematical statistical theory. In general the 
recommended estimates are free from any important source of bias ; where 
this is not the case the circumstances in which bias can arise are indicated. 
Estimates are required not only for the whole population, but frequently 
also for the different parts of it which constitute domains of study. The 
formula; of estimation which are applicable to the whole population are in 
genera! applicable also to the separate domains, and need not be discussed 
separately. In certain cases adjustments which can be applied to the population 
estimates cannot be applied to the estimates for the different domains. Thus 
if the population mean of a supplementary variate is known, but not the means 
for the different domains of study, adjustment by means of supplementary 
information can be applied only to the estimates of the whole population. In 
like manner the gain in accuracy due to stratification is sometimes different 
for the population and domain estimates. If the domains cut across the strata, 
for example, the errors of the domain estimates may only be slightly reduced 
by the stratification. 

6.2 Notation 

It is important, both in discussion of the problem of estimation and in 
the mathematical formulas, to make a clear distinction between estimates of 
the population values and the population values themselves. The present 
convention in mathematical statistics is to denote the population parameters 
by Greek letters and the corresponding estimates by the corresponding Latin 
letters. This convention, however, is difficult to apply consistently, and is 
in any case more appropriate for infinite hypothetical populations than for 
the finite populations met with in sampling. In the present manual we have 
for the most part adopted the convention of denoting the population values 
by bold type, the corresponding estimates of these values by^ Gill Sails type, 
and values appertaining to the selected sampling units by ordinary italic type. 
Thus, with a quantitative character or variate y, the values for the selected 
sampling units will be denoted by y (with or without suffices as necessary), 
the mean of these values will be denoted by y (following the ordinary convention), 
the estimate of the mean of the population by y, and the true mean of the 
population by y. With a random sample we shall have y =y, but^y differs 
from y by the sampling error. Totals for the population are indicated by 
capitals, summation over the sample values by S> and summation over the 
different strata by S. 

In certain types of estimation we shall be concerned with the use of 
supplementary information, such as size of unit, which is known not only for 
the selected sampling units but also for the whole of the population, or in 
the case of two-phase sampling, for the larger number of units selected at 
the first phase. A variate representing quantitative supplementary information 
will be denoted by x. 
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Even when information on a variate is not available for the whole population, 
it may be necessary to make an estimate of the population values for some 
standardized value of this variate. The letter x will also be used to denote 
a variate of this type. 

The following is a list of the principal symbols employed in this chapter : 

b, estimated regression coefficient. 

/, working sampling fraction. 

f, exact sampling fraction. 

g, working raising factor (= I//). 
g, exact raising factor (= 1/f). 

i (suffix), denotes values belonging to a particular stratum L 

n, number of units in the sample. 

N, N, number of units in the population, and its estimate. 

p, p, proportion of units in the population possessing a given 

attribute, and its estimate. 

r, ratio y/x. 

f, r, ratio Y/X, and its estimate. 

S, summation over the units of the sample. 

Si, summation within stratum L 

S, summation over the strata. 

u, number of units in the sample possessing a given attribute. 

U, U, number of units in the population possessing a given attribute, 

and its estimate. 

x, supplementary quantitative variate, such as size of unit. 

X, X, total of x for the population, and its estimate. 

j, quantitative variate under investigation. 

Y, Y, total of y for the population, and its estimate. 

f, x, y, means of r, x, y, for the sample. 

x, y, x, y, means of x and y for the population, and their estimates. 

6.3 General rules 

There are certain fundamental rules of estimation which apply to all types 
of sampling. These are : 

Rule l The population total of a quantitative variate 

To estimate totals for the population multiply all sample values by 
their raising factors (equal to the reciprocals of the sampling fractions) 
and sum the raised results. 

Rule 2 Number of units in the population 

To estimate the number of sampling units in the population follow 
Rule 1, scoring each selected sampling unit as 1. 
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Rule 3 The population mean of a quantitative variate 

Divide the estimated total of the variate for the population by the 
estimated number of units in the population. 

Rule 4 Proportion (or percentage) of units possessing a given qualitative character 

Proceed as for a quantitative variate, scoring all units possessing the 
given character as 1 and all others as 0. Divide the estimated total score 
by the estimated number of units in the population. 

Rule 5 Ratio of two quantitative variates 

Estimate the totals of the two quantitative variates for the population 

by Rule 1 and take the ratio of these totals, (Rules 3 and 4 are special 

cases of Rule 5.) 

In cases in which the probability of selection of all units is the same 
(uniform sampling fraction or, in the case of multi-stage sampling, uniform 
overall sampling fraction), the first four rules can be condensed into the simple 
general rule that means and proportions in the population are estimated by 
the , corresponding means and proportions in the sample, and totals and 
numbers in the population are estimated by multiplying the corresponding 
totals and numbers by the common raising factor. 

The above rules cover most of the methods of estimation discussed in this 
manual except those involving regression, which cannot easily be summarised 
in simple rules. They give rise to the formulae of estimation set out in the 
following sections of this chapter. 

6.4 Random sample 

Number : 

N= 

N will be equal to N except for minor discrepancies due to the use of a 
working sampling fraction which does not give an integral number of sampling 
units. If N is known then the true sampling fraction f equals n/N and the 
true raising factor g equals N/n. 

Mean of a quantitative variate : 

(6. 4. a) 



Total of a quantitative variate : 

Y=^ 
or more accurately, if N is known, and differs from 



Proportion possessing a given attribute 

u 

D -ss . 

P n 
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Number possessing a given attribute : 

U gu = Np 
or, more accurately, 

U' = gw=Np (6.4.c) 

The same formulae of estimation will hold for systematic samples from 
lists, etc. 

Example 6.4. a 

In a housing survey of a town a systematic sample from a list of all houses 
was taken with a sampling fraction of 1/50. 627 houses out of a total of 8491 
in the sample were classified as defective. What is the estimated number and 
percentage of defective houses in the town ? 

627 
Percentage defective = 100 p = 100 X ^rrr = 7-38 per cent. 



Total number defective = U = 50 X 627 = 31,350 

Example 6.4.b 

If the values in Table 6 . 4 are taken to represent measurements on a random 
sample of 20 objects, selected from a batch of such objects with a sampling 

TABLE 6.4 SAMPLE OF 20 MEASUREMENTS 

6-2 8-0 8-2 11-0 

13-8 12-0 8-7 10-3 

8-0 10-7 8-5 14*6 

7-6 9-1 10-1 8-0 

10-3 10-4 9-3 9-0 

fraction of 1/25, estimate the mean measurement of the batch, and the total 
of all the measurements of the batch. 

N = 25 X 20 = 500 
S(y) = 193-8 

y = l X 193-8 = 9-69 

ZiO 

Y = 25 X 193-8 = 4845 

If the number in the batch is known to be 507, a slightly more accurate 
estimate of the total is 

Y' = 507 x 9-69 = 4913 
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6.5 Stratified sample with uniform sampling fraction 

The formulae for a random sample hold, except that if the numbers In the 
different strata N/ are known, and differ from N/, the formula 6.4.b is 
replaced by 

Y' = 2(N/yO (6. 5. a) 

and the formula 6.4.c by 

U' = S(N/p/) (6.5.b) 

with corresponding slight increases in accuracy in y and p, if they are derived 
from these estimates by division by N. 

Example 6.5 

Table 6. 5. a shows the wheat acreages of the stratified random sample of 
1 in 20 Hertfordshire farms described in Section 3.7. Estimate the total 
wheat acreage of the county and the mean acreage of wheat per farm (a) from 
the data of the sample alone, (b) given the total number of farms in each 
size-group. Estimate also the number of farms growing wheat. 

TABLE 6. 5. a HERTFORDSHIRE FARMS, 1939: ACREAGES OF WHEAT IN A 

STRATIFIED RANDOM SAMPLE OF 1 IN 20 FARMS (STRATIFIED BY ACREAGES 
OF CROPS AND GRASS) 



Size-group . 


3 


4 


5 


6 


Acres . 


21-50 


51-150 


151-300 


301- 


No. of farms 


18 


26 


20 


13 




8 


49 It) 20 5G 


72 




5 


10 14 


24 18 


92 







27 4 


30 17 


69 







33 


59 32 


78 







4 12 I 17 71 


51 




9 


30 13 70 48 


84 







80 70 







8 


16 62 


102 




5 


13 28 


36 


13 






5 





92 






27 23 




158 






10 22 




62 






24 3 







TOTAL 


35 


386 


710 


873 



Size-group 1 (1-5 acres) : 22 farms, no wheat. 

Size-group 2 (6-20 acres) : 26 farms, 7 acres of wheat on 1 farm. 
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The results are summarized in Table 6.5.b. The estimate of the total 
area of wheat in the county from the sample is 

Y = 20 x 2011 == 40,220 acres 
The mean area of wheat per farm is 

Y = 2011/125 = 16-1 acres 
The number of farms growing wheat is 

U = 20 X 54 = 1080 

TABLE 6.5.b SUMMARY OF SAMPLE OF TABLE 6. 5. a 



Size- 


No. of 


Farms 
with, wheat 
in sample 


Wheat acreage 
in sample 


No. of 


Total 


group, 


farms 






farms 


wheat 


acres 


in sample 


No. 


Pro- 
portion 


Total 


Mean 


in county 


acreage 


1-5 


22 





000 





0-0 


435 





6-20 


26 


1 


038 


7 


0-3 


519 


160 


21-50 


18 


5 


278 


35 


1-9 


357 


680 


51-150 


26 


21 


808 


386 


14-8 


519 


7,680 


151-300 


20 


16 


800 


710 


35-5 


400 


14,200 


301- 


13 


11 


846 


873 


67-2 


266 


17,880 


ALL 


125 


54 


432 


2,011 


16-1 


2,496 


40,600 



If the total number of farms in each size-group is known, the estimate of 
Y can be calculated with slightly more accuracy by using the size-group means, 
as shown in the last three columns. This gives an estimate Y' of the total 
area of wheat of 40,600 acres, and a mean area per farm f of 40,600/2496 = 16-3 
acres. The gain in accuracy is here quite trivial, since the variation within 
each stratum is large relative to the mean of that stratum. 

The number of farms growing wheat can be estimated similarly from the 
proportions in the size-groups, giving 



U' = X 435 + 0-038 X 519 + . 



1083 



Again the gain in accuracy is trivial. 



151 



SECT. 6.6 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

6.6 Random sample, stratified after selection 

The means of, or proportions in, the different strata must be calculated 
separately, and formula 6 . 5 . a and 6 . 5 . b used, with division by N for estimates 
of y and p. 

Example 6.6 

Table 6. 6. a shows the data, including acreages of crops and grass, for 
the random sample of 1 in 20 Hertfordshire farms described in Section 3,7. 
Estimate the total area of wheat and the number of farms growing wheat 
(a) directly from the sample, (b) by stratification by size, given the total numbers 
of farms in the size-groups of Table 6.5.b. 

TABLE 6. 6. a HERTFORDSHIRE FARMS, 1939: ACREAGES OF CROPS AND GRASS 
(1ST COLUMN), AND OF WHEAT (2ND COLUMN), OF A RANDOM SAMPLE OF 
1 IN 20 FARMS (CLASSIFIED BY DISTRICTS AFTER SELECTION) 



District 1 
15 farms 


District 3 
40 farms 


District 4 
24 farms 


District 5 
4 farms 


District 6 
24 farms 


188 16 
60 


370 67 
26 


40 
28 


11 
6 


4 
312 102 


8 
87 14 


192 


369 58 


221 59 


543 80 


8 


6 


48 


212 45 


31 


822 265 


11 


44 


44 
79 33 


153 20 

287 44 


6 
34 


654 112 
3 


335 102 


4 
614 72 


14 


28 


316 75 


158 50 




192 20 


465 92 


14 


116 33 


4 




10 


197 


4 


4 


68 27 


District 1 


24 


163 


17 


409 102 


55 12 




2 


198 


2 


6 


4 


10 farms 


9 


78 


3 


115 


2 




3 


6 


7 


19 


192 24 


128 5 


2 


35 


6 


274 6 


4 


4 


120 24 


168 


335 82 


3 


491 24 


46 


58 




4 


144 


224 28 


181 20 


20 


1,935 141 


1 


3 


280 75 


17 


30 




4 


482 62 


90 


24 


197 6 




180 


156 28 


3 


10 


14 3 


District 2 


120 11 


302 71 


3 


36 


32 6 








6 


12 


2 


8 farms 




4,851 763 


4 


89 


285 29 








161 80 





138 


8 






246 60 


547 25* 


126 


294 29 
597 107 






4,034 837 




2,027 174 


8 












2 








GRAND TOTAL, 




200 65 








125 farms : 


15,114 2,301 


14 












262 58 












1,385 259 
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TABLE 6.6.b HERTFORDSHIRE FARMS, 1939: ESTIMATION OF WHEAT ACREAGE 

FROM THE RANDOM SAMPLE OF 1 IN 20 FARMS (TABLE 6 . 6 . a) STRATIFIED BY 
SIZE-GROUPS AFTER SELECTION 







Farms 


Acreage 






Size- 




with wheat 


of wheat 


No. of 


Total 


group 


No. in 






farms 


for 


acres 


sample 


No. 


Pro- 
portion 


Total 


Mean 


in county 


county 


1-5 


25 














435 





6-20 


26 


I 


038 


3 


0-1 


519 


60 


21-50 


16 


1 


062 


6 


0-4 


357 


140 


51-150 


17 


8 


471 


159 


9-4 


519 


4,880 


i 














151-300 1 26 


20 


769 


762 


29-3 


400 


11,720 


301- 


15 


15 


1-000 


1,371 


91-4 


266 


24,310 




125 


45 


360 


2,301 


18-4 


2,496 


41,100 



(a) Total area of wheat = 2301 X 20 = 46,020 acres. 
Number of farms growing wheat = 20 X 45 = 900, 

(b) Classifying the data by size-groups (crops and grass) the numbers and 
totals shown in Table 6.6.b are obtained. The mean wheat acreage 
is then calculated for each size-group, multiplied by the total number 
of farms in that size-group, and the products summed, giving an 
estimated total wheat acreage of 41,100. Similarly, using the proportion 
of farms with wheat instead of the mean acreage for each size-group, 
the estimated number of farms growing wheat is found to be 
038 x 519 + -062 x 357 + , . . = 860. 

6.7 Stratified sample (variable sampling fraction) 

N = S (gt m) 



y=Y/N 
U = ( gi ut) 
p = U/N 

If the Ni are known, the alternative formulas 6 , 5 . a and 6 . 5 . b, with division 
by N for y and p, are slightly more accurate. 
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Example 6.7 

Table 6.7 .a shows the Hertfordshire farm data for the stratified systematic 
sample with a variable sampling fraction described in Section 3.7. Estimate 
the total wheat acreage and the number of farms growing wheat. 

TABLE 6. 7. a HERTFORDSHIRE FARMS, 1939: STRATIFIED SYSTEMATIC SAMPLE 
OF WHEAT ACREAGES, WITH A VARIABLE SAMPLING FRACTION (CLASSIFIED 
BY DISTRICTS) 



Size-group : 
Sampling 
fraction : 
No. in 
sample : 


1-5 

Nil 



6-20 
1/200 
3 


21-50 
1/60 
6 


51-150 

1/20 
26 


151-300 
1/10 
40 


301-500 

1/5 
43 


501- 

1/3 

17 


District 
I 











30 
6 


17 18 

28 


172 92 
56 


114 


2 







10 



40 


30 16 50 
55 62 


50 49 121 
63 72 100 
186 124 105 
104 


119 
107 
101 
160 


3 










25 5 
10 



77 
41 42 
24 25 
61 


67 22 5 
58 75 51 
78 94 126 
86 97 


195 
120 


4 







17 


28 24 
8 
5 


42 24 
54 60 75 
44 6 


88 65 58 
94 115 98 
121 80 92 

18 40 120 


268 260 
265 260 
112 155 

240 1(58 
209 


5 














22 31 32 

27 


66 142 26 





6 


"""~~ 










19 


38 
56 17 29 






72 


7 












14 


60 


16 





TOTAL 








27 


214 


1,163 


3,292 


2,925 



The calculations are shown in Table 6.7.b. They follow the same lines 
as before, except that the sample total for each stratum must be raised 
separately. Using the working sampling fractions we obtain estimates of 
42,765 acres of wheat and 911 farms growing wheat 
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TABLE 6.7.b SUMMARY OF SAMPLE OF TABLE 6. 7. a 



Size- 
group 
acres 


No. 

in 
sample 


No. 
with 
wheat 


Total 
acreage 


Raising 
factor 


Raised totals 


Mean 
acreage 
per 
farm 


No. 


Acreage 


1-5 























6-20 


3 








200 











21-50 


6 


2 


27 


60 


120 


1,620 


4-5 


51-150 


26 


12 


214 


20 


240 


4,280 


8-2 


151-300 


40 


30 


1,163 


10 


300 


11,630 


29-1 


301-500 


43 


40 


3,292 


5 


200 


16,460 


76*6 


501- 


17 


17 


2,925 


3 


51 


8,775 


172-1 




135 








911 


42,765 










i 









6.8 Use of supplementary information in estimation 

As already indicated, supplementary information on a quantitative character, 
the values of which are known for all the units of the population, can be used 
as the basis of stratification, or for the adjustment of an unstratified sample 
by stratification after selection. Alternatively, as mentioned in Section 2.8, 
such information can be used directly without stratification. Two methods, 
the ratio method and the regression method, are available. In either case 
only the total or mean of the supplementary variate for the whole population 
need be known (in addition to the values for the selected sampling units). 
The ratio method is simpler computationally, but the regression method is 
in certain circumstances more accurate. 

In the ratio method, the ratio of Y/X in the population is estimated from 
the sample, the estimated ratio being multiplied by the total X of x for the 
population to give the estimated total Y of y for the population. The method 
of estimation must be such that bias is avoided. As already explained in 
Section 2.6, the appropriate estimate of the ratio for a random sample is 
S(y)/S(x) or y/x. More generally, Rule 5 of Section 6.3 will give an 
unbiased estimate, though in certain cases separate values of the ratio may 
be estimated for the different strata as described in Sections 6.10 and 6.11.* 

In the regression method the average change of y for unit change of * 
(known as the regression coefficient) is estimated, and this coefficient is used to 
adjust the sample results for any discrepancy between the mean size of unit 
in the sample and in the population. 

* An extension of the ratio method, the double ratio estimate, is described in Section 
10.5. 
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The contrast between the ratio and regression methods is illustrated in 
Fig. 6.8. The data plotted are those of Table 6.12. The dots represent 
the and y values of the sample points, the sample mean (#, y) being M. 
Q' represents the known population mean x of the supplementary variate, 
which differs from x by QQ 1 . The line OMD through the origin and the 
mean represents the ratio y/ given by the sample, and the ordinate P l Q f 
of the point P x on this line, equal to (yjx) x, gives the adjusted estimate of 
the population mean by the ratio method. The regression line AMB also 
passes through the mean, and has a slope b equal to the regression coefficient. 



300 




50 100 Q' Q - 150 200 

VOLUMES, #, OF CORRESPONDING STANDS (caftPER . 



250 
ACRE) 



300 



FIG. 6.8 USE OF SUPPLEMENTARY INFORMATION: RATIO AND REGRESSION METHODS 

(DATA OF TABLE 6.12) 

This line has the property that the sum of the squares of the vertical distances 
of the sample points from it is minimum. The adjusted estimate by the 
regression method is given by the ordinate P 3 Q' of the point P 2> and equals 
y + b(-x). 

The regression method therefore differs from the ratio method in that in 
the former the straight line which best fits the sample values is taken, whereas 
in the latter the line through the origin is taken. When the supplementary 
variate x represents size of unit, the true regression line generally passes 
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through the origin, though curvature of this line may result in the best-fitting 
straight regression line not passing through, or even very close to, the origin. 
Nevertheless, in most census work in which x represents size, or some variate 
closely correlated with size, the greater simplicity of the ratio method outweighs 
any small gain in accuracy resulting from the use of regression. 

It may be noted that in large samples the regression line can be plotted 
by grouping the data according to the x values and plotting the means of y 
for the different groups. 

The formulas for regression have been included in this book, not because 
it is expected they will be very commonly used in census work, but because 
the regression method represents an important part of sampling procedure, 
without which no account of sampling methods would be complete, and 
because the calculation of the sampling errors to which a balanced sample 
is subject can only be made by use of the regression concept. 

If the population mean of x is estimated from observations at the first 
phase of two-phase sampling, the observations for y being obtained at the 
second phase, the same formulae of estimation hold, the estimate 5q being 
substituted for x.* If, however, the sampling is single-phase, and the estimate x 
from the sample is substituted for x, the formula y = y appropriate to a 
random sample without supplementary information is obtained. In other 
words, there is no gain in accuracy in the population estimate unless x is 
known or alternatively is estimated from a larger sample than is available for y. 

In addition to their use in the adjustment of the population estimate of the 
mean or total of y when the mean or total of x is known for the population, 
or is determined from the first phase of two-phase sampling, ratios and 
regressions are of use for the purpose of obtaining estimates of the means of 
y for some standardized value of x. Hence comparisons of different parts of 
the population can be made, freed from the effects of variation in the average 
values of x. In the case of ratios this is equivalent to comparing the values 
of ratios themselves : thus in an agricultural survey we may consider such 
quantities as number of sheep per 100 acres instead of number of sheep per 
farm. Regression enables similar standardization to be made in cases in which 
the ratio method is inappropriate ; in a nutrition survey, for example, it may 
well be found that the amount of malnutrition varies with size of family, but 
the relation will not be proportionate. In large-scale surveys, however, 
standardization of this type can equally well be made by using the size-group 
means, thus avoiding the trouble of calculating regression coefficients. 

The formulae in the following sections are given for a quantitative variate. 
Formulae for the proportion of units possessing a given attribute can be derived 
by scoring each unit as 1 if it has the attribute and zero otherwise. If the 
supplementary variate x represents size, the proportion of the attribute per 
unit size may be required, in which case a unit of size x will be scored if it 
has the attribute, and zero otherwise. 

* Note that tlie observations on x for the units for which y is also observed are part 
of the first-phase information, and must be included when calculating this estimate. 
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Example 6.8 

The data of Table 6.8 are extracted from the Report of the National Farm 
Survey of England and Wales (Ministry of Agriculture and Fisheries, 1946, G). 
They give the rents of holdings per acre of crops and grass, classified by size- 
groups, in Berkshire and Cornwall. Calculate rents per acre standardized for 
size of holding, in the proportions in which the different size-groups occur 
in the whole country. 

TABLE 6.8 RENTS PER ACRE FOR BERKSHIRE AND CORNWALL 



Size-group 
acres 


Rent per acre (shillings) 


Proportionate areas of 
different size-groups 
in whole country 


Berkshire 


Cornwall 


5-25 


53 


55 


I 


25-100 


32 


31 


6 


100-300 


24 


22 


10 


300-700 


20 


18 


4 


700- 


17 


14 


1 


Overall 


23 


28 


00 



The proportionate areas for the whole country are shown in the last column. 
The standardized rent for Berkshire, using these areas as weights, is 
(1 X 53 + 6 X 32 + . . .)/22 = 26 shillings per acre, and that for Cornwall 
is 25 shillings per acre. 

Inspection of the table shows that although the overall rent per acre is 
considerably less for Berkshire than for Cornwall, there is little difference 
between the two counties for the different size-groups, the rents for Berkshire 
being in general somewhat greater than those for Cornwall. This is brought 
out, in a single contrast, by the standardized rents. The lower overall rent 
per acre for Berkshire is in a certain sense accounted for by the greater 
proportion of large farms in that county. 

This example illustrates both the use and the danger of standardization of 
this type. The standardized rents eliminate the effects of differences in average 
size on the average rent, which in so far as they are due to greater concentration 
of buildings on the smaller holdings, greater demand for smaller holdings, 
etc., do not represent differences in value of the land. It would be incorrect, 
however, to assume that were the size-distributions of farms in the two counties 
the same the overall rents per acre would be the same. Part of the difference 
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is due to the tendency of poorer land to be farmed in larger units, and such 
land would not command the full increase in rent which is apparently attracted 
by smaller farms if it were divided into smaller units. 

6.9 Ratio method : random sample 



Y 

Y = 



, . __ 

S(x) 

The formula for J may be used for obtaining the " standardized " value 
YQ of y f r a standard value x of #, or for estimation in two-phase sampling 
using an estimate 5^ obtained from first-phase information. 

Example 6.9. a 

Estimate the total area of wheat from the data of Example 6.6 by the 
ratio method, given that the total area of crops and grass in the county is 
273,074 acres. 

Crops and grass in sample = S (x) = 15,114 acres 
Wheat in sample = S (y) = 2301 acres 

2301 

Estimate of wheat acreage in county = T^TTT X 273,074 

10,1 JL* 

= 0-15224 x 273,074 
= 41,570 acres. 
Example 6.9.b 

Table 6 . 9 gives the numbers of persons belonging to 43 kraals which form 
a random sample of the 325 kraals in the Mondora Reserve in Southern 
Rhodesia, and also the numbers of persons absent from these kraals. (The 

TABLE 6 . 9 DATA FROM A RANDOM SAMPLE OF 43 KRAALS : TOTAL NUMBER 
OF PERSONS (INCLUDING ABSENTEES), x, AND NUMBER OF ABSENTEES, y 

x y x y x y x y 

95 18 89 7 75 12 159 36 

79 14 57 9 69 16 54 26 

30 6 132 26 63 9 69 27 

45 3 47 7 83 14 61 2 

28 5 43 17 124 25 164 69 

142 15 116 24 31 3 132 41 

125 18 65 16 96 45 82 10 

81 9 103 18 42 25 33 8 

43 12 52 16 85 35 86 22 

53 4 67 27 91 28 51 19 

148 31 64 12 73 13 - - 

3,427 799 
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data form part of the results of a sample census of the Hartley District, and 
have kindly been made available by Dr. J. R. H. Shaui) Estimate the 
percentage of persons absent from the reserve, and the numbers of persons 
belonging to the reserve and absent from the reserve. 

799 
Percentage absent = 100 r = x 10 ^ 23 ' 3 P er cent ' 



325 
Number belonging to reserve = X = X 3427 = 25,902. 

325 
Number absent from reserve = Y = X 799 = 6039. 

Number present in reserve = 25,902 - 6039 = 19,863. 

This example, though superficially similar to Example 6. 4. a, is structurally 
different, in that the sampling units consist of kraals and not individuals. 
This affects the estimation of the sampling error (see Section 7.8). 

6.10 Ratio method : stratified sample with uniform sampling fraction 

(a) When the ratio is assumed to be the same for all strata : 
The formulae for a random sample hold. 

(b) When the ratio is permitted to assume different values for the different 
strata : 

Treat each stratum separately, using the formula? for a random sample, 
and build up the population estimates by summation of the estimates for 
the separate strata, with division by N or N for the population means. 

This gives 



etc. 

The choice between method (a) and method (b) depends on : 

(1) Numbers in the different strata method (b) can only be used if the 
numbers of units from the individual strata are sufficiently large to 
give reasonably accurate determinations of the values of ratio for the 
separate strata : if the numbers are small and there is correlation 
between r and x, the method will be biased. (This objection does not 
hold if selection with probabilities proportional to A? is used see 
Section 3.10.) 

(2) The degree of variation in the ratio between the different strata the 
greater this variation the greater will be the gain in accuracy by the 
use of method (b) ; if the variation is small, method (a) may be the 
more accurate. 
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(3) Computational convenience method (a) is simpler, since only one 
value of the ratio is involved. 

Method (b) can also be used for a random sample stratified after selection, 
provided the population totals of x for each stratum are known. 

Example 6,10 

Estimate the total area of wheat from the data of Example 6.6 by the 
ratio method, stratifying the data by districts, and using different values of 
the ratio for the different districts. 

TABLE 6.10 HERTFORDSHIRE FARMS, 1939: ESTIMATION OF WHEAT ACREAGES 
FROM THE RANDOM SAMPLE OF 1 IN 20 FARMS BY THE RATIO METHOD AFTER 
STRATIFICATION INTO DISTRICTS 



District 

No. 


Sample 


District 
crops and 
grass 

x< 


Estimated 
district 
wheat 

!**X, 


No. 


Wheat 

St(y) 


Crops and 
grass 
S,(*) 


Ratio 
r< 


1 


15 


141 


1,935 


0729 


22,932 


1,670 


2 


8 


259 


1,385 


-1870 


43,591 


8,150 


3 


40 


763 


4,851 


1573 


57,263 


9,010 


4 


24 


837 


4,034 


2075 


73,946 


15,340 


5 & 7 


14 


127 


882 


1440 


40,905 


5,890 


6 


24 


174 


2,027 


0858 


34,437 


2,950 




125 


2,301 


15,114 




273,074 

1 


43,010 



The computations are shown in Table 6.10. The neighbouring districts 
of St. Albans (5) and Watford (7) (each of which contains rather a small 
number of farms) have been combined, 

6*11 Ratio method : stratified sample with variable sampling fraction 

(a) Ratio the same for all strata : 

- 
^ 



St (*)} 



Y = rX 
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(b) Ratio different for different strata : 

Proceed in the same manner as for a fixed sampling fraction. The sampling 
fraction does not enter into the calculations. 

6.12 Regression method : random sample 
The equation of the regression line is 

yi =f + b(x-) 

wbere t _*OL=<!Lr^ (6 .i2.a) 

*~ * ; 



y=j + (x~~*) (6.12.b) 

Y=Ny 
If N is not known exactly, it must be estimated from the sample. 

Note that if b is put equal to S(y)/S(x) formula 6.9 is obtained, and if 
put equal to formula 6. 4. a is obtained. All values of b will give unbiased 
estimates, and consequently any value b which appears appropriate to the 
data under analysis may be used. Thus, taking b Q = 1 is equivalent to the 
use of the differences yx. The regression method furnishes the value of b 
which gives the most accurate estimate of y, using a formula of type 6.12.b, 
at the cost of some additional computational labour, 

Regressions may be used for standardization and in two-phase sampling 
in the same way as ratios. 

Example 6, 12, a 

Obtain an estimate of the total area of wheat from the data of Example 6.6, 
using the regression method. 
We have 

:* = 120-912 y*= 18-408 N = 2496 x = 109-405 
s ^ 5,061,734 S (xy) = 902,958 S (y*) = 207,261 

* S (*) = 1,827,464 y S (*)*** S (y) 278,219 fS(y)*= 42,357 



S (*-*)** 3.234,270 S (x - $(y -y) = 624,739 S (y -y)* - 164,904 

The method of calculation of the sums of squares and products is explained 
in Section 7 . 1. The sum of squares of y will be required in the calculation 
of the sampling error. 

We then have 



f = 18408 + 0-19316 (109*405 - 120-912) = 16-185 
Y = 2496 X 16-185 40,400 acres 
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Example 8.12.b 

Table 6.12 gives the measured volumes of timber on 25 systematically 
located plots of 1/10 acre, and eye estimates of the volumes per 1/10 acre in 

TABLE 6.12 MEASURED VOLUMES, y, ON 25 SAMPLE PLOTS, AND EYE ESTIMATES, 

X, OF CORRESPONDING STANDS (CU. FT, PER 1/10 ACRE) 
y x y % y % y x 

170 102 195 208\ 153 79 169 1521 

47 14 255 208J 216 177 182 152 j\ 

64 57 135 1101 125 65 74 148 

91 70 146 110 1 100 196 24 207 

126 95 154 110 [ 287 167 255 167 

146 92 110 110J 261 268 3,684 3,302 

87 110 112 128 147-36 132-08 

the stands in which they occurred. If more than one sample plot occurs in 
a stand this is indicated by a bracket, but the observations have been treated 
as independent in the subsequent computations. The data, which refer to 
conifer stands of uniform age and over 20 years of age in two counties, were 
obtained in the course of the 1938-9 Census of Woodlands. They are plotted 
in Figure 6 . 8. The total area of conifer stands over 20 years of age in these 
two counties was 5124 acres, and the total volume of timber, from eye 
estimates of all these stands, was 6,110,000 cu. ft., i.e. 1192 cu. ft. per acre. 
Obtain unbiased estimates of the total volume of this class of timber from the 
above data. 

The mean of the measured volumes on the sample plots provides an unbiased 
estimate of the volume per acre, and the estimate of the total volume, based 
on the measurements of the sample plots only, is consequently 

147-36 X 10 X 5124 = 7,551,000 cu. ft 
The ratio method gives 

1192 x 5124 x 1473-6/1320-8 = 6,814,000 cu. ft. 

Elimination of possible bias in the eye estimates by taking the difference 
between the measured volumes and eye estimates on the sample plots 
(equivalent to the use of the regression method with an arbitrary coefficient 
b = 1) gives 

5124 (1473-6 1320-8 + 1192) = 6,891,000 cu. ft. 

The regression method proper requires the calculation of the regression 
coefficient. We find 

S (y -j;)* = 115,266 S (y - y) (#-*) = 52,069 S(x- xf = 82,296 
b = 52,069/82,296 = 0-63270 

Y = 5124(1473-6 + 0-63270 (1192 1320-8)} 7,133,000 cu. ft. 
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The relative accuracy of these various methods of estimation is discussed 

in Example 7.12.b. 

The above data are of course only a small part of the full data for the survey. 
Examination of the whole of the data for the above two counties gave a value 
of b of 0-55. The bias in the eye estimates, which are too low, though nui v^y 
large, is apparent in the above data. The average bias over the whole survey 
was decidedly larger, and misleading results would have been obtained by 
using the eye estimates without correction for bias from properly measured 
and randomly located sample plots. 

There is of course the possibility if the location of the sample plots has 
not been objectively carried out, or if the measurements have been carelessly 
made, e.g. by the inclusion of trees whose centres do not lie within the 
demarcated sample area that the sample plots will themselves be biased. 
The sample plots used in this survey were somewhat small, and the use of 
larger plots, particularly in the case of hardwoods, possibly with second-stage 
sampling of trees for measurement, would have reduced the risk of bias of this 
nature. The surveyors were well trained, however, and thoroughly appreciated 
the need for objectivity, and on examination it appeared that serious bias from 
this cause could be ruled out. The results of a later survey of England and 
Wales confirmed the correctness of the earlier survey. 

6.13 Regression method : stratified sample with uniform sampling 
fraction 

(a) When the regression coefficient is assumed to be the same for all 
strata : 

The formulae for a random sample hold, except that formula 6. 12. a is 
replaced by 



~~ S{S/ (*-*/)} 

(J) When the regression is permitted to assume different values in the 
different strata : 

Proceed as in the ratio method. 

6.14 Regression method : stratified sample with variable sampling 
fraction 

(a) Regression the same for all strata : 

y =j; w -f 6(X & v ) 

where 



the Jit being numerical weighting coefficients, and 
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with a similar expression for Xw yw and x w are the estimates of y and x that 
would be obtained from the sample if there were no supplementary information 
on x (see Section 6.7). 

If the regressions within strata are truly linear, with identical values of the 
regression coefficient, then the most accurate estimate of b will be obtained 
if the fa are taken inversely proportional to the residual within-strata variances 
of y about the regression lines. If the regression coefficients are different for 
the different strata, then the component of error due to the assumption of 
equality of regression coefficients will be minimized by taking At proportional 
to gi 2 . Any set of fa will give a virtually unbiased estimate of y , and detailed 
investigation of the theoretically best values to adopt is seldom worth while. 
For most work fa may be taken as unity if all the strata contain material of 
similar variability, i.e. if the variable sampling fraction arises from extraneous 
causes not connected with the variability of the material, and equal to gi if 
the sampling fractions have been chosen so as to minimize the sampling error. 
Under certain conditions fa = gf would be best in this case, but under other 
conditions this would give excessive weight to the strata with small sampling 
fractions. 

(b) Regression different for different strata : 
Proceed as in the ratio method. 

6.15 Use of regression to calibrate eye estimates 

It sometimes happens that eye estimates or similar subjective measurements, 
x, can be made on a properly selected and unbiased sample of the population, 
but that the objective measurements y, which are required to calibrate these 
estimates, can only be carried out on a non-random sub-sample of the original 
sample. The eye estimates cannot then be used as supplementary information 
in the manner of Example 6. 12. b, since any bias in the sub-sample used for 
the objective measurements would be reflected to a greater or less extent, 
depending on the value of &, in the population estimate derived from the 
regression. 

In this case the regression of x on y, instead of y on x, must be calculated, 
and the equation of estimation must be replaced by 

y=j+_L(*i-*), 

when y is the regression coefficient of x on y> y and & are the means for the 
sub-sample, and x l is the mean of the eye estimates for the original sample. 
This procedure is subject to certain limitations. Firstly, the sub-sample, 
though non-random for the whole of the population, must be effectively 
random for units having any given value of y. If, for example, there is a 
tendency to select units which, for a given value of y, have high values of x 
serious bias may result. Thus, in a crop-estimation scheme, if eye estimates 
are made on a random sample of fields, and if reliance is placed on returns 
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by farmers of the actual yields of some of these fields, any tendency on the 
part of the farmers to return only the yields of fields which have turned out 
better than their appearance would indicate will lead to an overestimate of the 
yield. On the other hand, the omission of a greater proportion of the low- 
yielding than of the high-yielding fields from the sub-sample will not bias the 
results, provided this omission is conditioned only by the final yield and not 
by the previous appearance or the value of the eye estimate. 

Secondly, for accuracy in the final estimate, the eye estimates must be 
reasonably accurate in the sense that variation about the regression line must 
be small, and the line itself must have an adequate slope. If the regression 
line is curved, this curvature can only be allowed for in the estimation formula 
if the variation about the regression line is negligible. Otherwise bias will be 
introduced. The use of the best fitting linear regression line, however, will 
avoid this source of bias. 

Example 6.15 

In order to test the accuracy of eye estimation as a method of estimating 
the yields of cereal crops shortly before harvest, a trial survey of the wheat 
crop of Hertfordshire was undertaken in 1940. Two observers were employed, 
one of whom visited 47 farms, observing 110 fields, and the other 16 farms, 
observing 37 fields. The whole set of farms constituted a systematic sample 
of 1 in 12 farms, excluding those growing less than 5 acres of wheat in 1939, 
a random sub-sample of fields being taken on the larger farms. The actual 
yields, as determined by the farmers, were subsequently obtained for as many 
of the observed fields as possible, and these were used to calibrate the eye 
estimates. The relation between the eye estimates, x, and the actual yields 
per acre, y, for the first observer are shown in Fig. 6.15 for the 37 fields for 
which yields were obtained. Obtain an estimate of the mean yield per acre 
for the part of the county covered by this observer. 

The regression coefficient, &', of x on y, calculated from the unweighted 
values of x and y for the 37 fields, is 0-6926, the regression equation being 

Xl = 30-00 + 0-6926 (y - 28-78) 

This is shown by the full line in the figure. The dotted line represents the 
line that would be obtained if there were no errors in the eye estimates. It 
will be seen that there is a tendency to underestimate high yields and over- 
estimate low yields. The other observer and the farmers gave very similar 
results. 

The mean of the eye estimates % for the whole of the first observer's 
sample, and that for the eye estimates x and the yields y of the fields for which 
yields were available, weighted according to the acreages of the fields, with 
an additional raising factor if all the fields on the farm are not sampled, are, 
in bushels per acre, 

#! = 30-12 x = 30-13 y = 28-95 
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Hence, since 1/0-6926 = 1*444, the final estimate of the yield per acre is 

y = 28-95 + 1-444 (30-12 30-13) == 28-94 
The adjustment is here negligible, since x-^ and x are almost identical. 
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FIG. 6.15 RELATION BETWEEN EYE ESTIMATES OF THE YIELDS AND THE ACTUAL YIELDS 
OF 37 FIELDS OF WHEAT IN HERTFORDSHIRE 

6.16 Sampling with probabilities proportional to size of unit 
(a) Size, re, of all units of the population known, or X known : 
In this case x acts as a supplementary variate, and the ratio method will 

in general be appropriate. Since the probability of selection is proportional 

to #, raising factors proportional to l/x must be introduced into the formulae 

already given. This leads to the formulae 



f = -S[- = 7 
n \ , 

Y = rX 
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In other words the unbiased estimate of the population value of the ratio is 
given by the arithmetic mean of the ratios from the selected sampling units. 

(b) Total size X of the population not known : 

In this case X, as well as Y and f, have to be estimated from the sample. 
Selection has to be made by some such process as randomly or systematically 
locating points on a map, and points not falling in the units under consideration 
must be taken into account. If n is the total number of sampling points, 
and A is the total area covered by the sampling grid, we have 

f = f 

X = A n/n Q 
Y = FX = fAfl/ 
Alternatively, if A is not known exactly the density d of points per unit area 

may be used. We have 

A = n Q /d 

X^n/d 

If the sampling is two-phase, with n ' points (density d f ) at the first phase, 
of which n r fall in the units under consideration, n, n and d must be replaced 
by n', n ' and d' in the above formulae. 

Example 6. 16. a 

In a survey to estimate the area and yield of a crop, systematically located 
points at a density of one per 4 square miles are taken, and the yields per acre 
of the fields in which the points fall and which carry the crop are determined 
by the harvesting of small areas. 8317 points in all are obtained, of which 
529 fall in fields carrying the crop. The arithmetic mean of the yields per acre 
of the selected fields is 15-7 cwt. per acre. Estimate the total area and yield 
of the crop. 

A density of 1 per 4 square miles is equal to 1/2560 per acre. Hence 
area = X = 529 X 2560 = 1,354,000 acres 
yield = Y = 15-7 X 1,354,000 cwt. == 1,063,000 tons 

Example 6.16.b 

If, in addition to the yield data of Example 6. 16. a, a further 24,938 points 
were surveyed for type of crop only, giving an overall density, with the 
8317 points of the yield survey, of one point per square mile, and 1673 of the 
fields so located were found to carry the crop in question (in addition to the 
529 fields above), obtain revised estimates of the total area and yield of the crop. 

This is an example of two-phase sampling. The two sets of points together 
constitute the first phase, for area of crop, and the 529 points for which yield 
samples were taken constitute the second phase. 
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We therefore have 

' = 24,938 + 8317 = 33,255 

ri = 1673 + 529 = 2202 

The density at the first phase is 1/640 per acre. Consequently 
area =- X = 2202 x 640 = 1,409,000 acres 
yield = Y = 15-7 x 1,409,000 cwt. = 1,106,000 tons 

6.17 Sampling from within strata with probabilities proportional to 
size of unit 

In this case the sizes of all units will be known. As pointed out in 
Section 3.10, if more than one unit is selected from some or all of the strata, 
the same unit not being selected twice, the probabilities will not in fact be 
exactly proportional to size, and slight bias will be introduced. The ratios 
from the selected units are meaned separately for each stratum, giving 
equations of estimation 



Example 6.17 

Estimate the acreage of wheat and the number of farms growing wheat 
in Hertfordshire from the sample of parishes described in Section 3.11. 

TABLE 6. 17. a HERTFORDSHIRE WHEAT: SAMPLE OF 17 "COMBINED" 

PARISHES SELECTED FROM WITHIN DISTRICTS WITH PROBABILITY PROPORTIONAL 

TO SIZE 



District . 


1 


2 


3 


Wheat 
Crops and grass 


164 
3,350 


766 701 503 
3,040 3,440 2,040 


311 228 249 686 
2,370 3,330 2,290 2,930 


Ratio 


049 


252 -204 -247 


131 -068 -109 -234 



District . 


4 


5 


6 


7 


Wheat 


558 


775 


495 


565 


862 


818 


225 


738 


290 


Crops and grass 


2,300 


4,430 


2,890 


2,420 


4,160 


3,470 


2,520 


J,740 


3,060 


Ratio 


243 


175 


171 


233 


207 


236 


089 


197 


095 



169 



SECT. 6.18 SAMPLING METHODS FOR CENSUSES AND SURVEYS 



TABLE 6.17.b ESTIMATION OF WHEAT ACREAGE FROM THE DATA 
OF TABLE 6. 17. a 



District 


Mean ratio 
wheat/crops 
and grass 


Acreage of 
crops and grass 
in district 


Estimated 
acreage 
of wheat 


1 


049 


22,932 


1,120 


2 


234 


43,591 


10,200 


3 

4 
5 


136 

206 
236 


57,263 
73,946 
24,964 


7,790 
15,230 
5,890 


6 


143 


34,437 


4,920 


7 


095 


15,941 


1,510 


TOTAL 




273,074 


46,660 



The data from the sampled parishes are shown in Table 6. 17. a, and the 
further computations for wheat acreage in Table 6.17.b. 

The computations for number of farms growing wheat follow exactly the 
same lines, using the ratio of number of farms growing wheat to acreage of 
crops and grass for each parish. These computations are left as an exercise 
to the reader. 

The use of the ratio of number of farms growing wheat to the total number 
of farms in the district would be equally admissible if the selection of parishes 
had been made with constant probability, but when the probability of selection 
is taken proportional to size of unit, the ratio to size, however defined, must 
be taken for all variates. 

6.18 Multi-stage sampling, no supplementary information 

In multi-stage sampling the process of estimation can be carried out stage 
by stage, using the appropriate methods of estimation at each stage. It is often 
more convenient, however, to combine all stages in a single process of 
estimation. 

Thus the combined or overall raising factor g for any sub-unit in two-stage 
sampling is given by the product of the first-stage raising factor g f of the 
main unit in which it occurs and the second-stage raising factor g" of the 
particular sub-unit, i.e. 



Hence the general formula for Y, when there is no supplementary information, 



where the summation is taken over all units, with similar formulae for N, etc. 
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If the combined raising factors for a group of units are equal then the 
computations will be simplified by summing these units before multiplication. 
In particular, if all the combined raising factors are equal, the sample can be 
treated for purposes of estimation as if it were an ordinary random or stratified 
random sample with unifonn sampling fraction. 

6.19 Multi-stage sampling with supplementary information 
(a) Ratio method, ratio the same for whole population : 



etc., where g = g g . 

(b) Ratio method, ratio different for different parts of the population : 
Many variants are possible. All can be resolved by proceeding stage 

by stage by the methods already outlined. The danger of introducing 
bias if the number of units on which the ratios are based is small must be 
recognized, 

(c) Regression method : 

Regressions will usually be employed at the first stage of the sampling, in 
which case the regression coefficient or coefficients will be calculated in the 
manner appropriate to the type of sampling adopted at this stage, using the 
values of the totals of x and y for each main unit estimated from the second- 
stage sampling. 

If regression is used at the second stage the procedure for stratified samples 
can be used, treating the selected first-stage units as if they were strata. 

(d) Sampling with probability proportional to size : 

An important case is that in which the first-stage units are sampled from 
within strata with probability proportional to size, and the second-stage 
sampling fractions are chosen so as to give a unifonn overall sampling fraction. 
In this case the use of the mean ratios r{ at the first stage of the estimation 
process (Section 6.17) will be found to be equivalent to the direct estimation 
from the second-stage units by means of the overall raising factors, *.*, Y = S (gy}>* 

Example 6.19 

From the data of Table 6. 19. a, obtained in the course of the Survey of 
Fertilizer Practice, estimate the average dressing of nitrogenous fertilizer on 
sugar beet in Norfolk. 

The two-stage sampling procedure of this survey has been described in 
Section 4:. 23. Information is lacking from a few farms, mainly owing to 
changes in tenancy. Since this affects the small farms to a greater extent 
than the large farms the adjusted sampling fractions shown in the table have 
been used. These are equal to the number of farms on which information 
is available divided by the total number of farms in the size-group. The 
* The estimation of error of this type of sampling is described in Section 10.9. 
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second-stage sampling fractions are given by the reciprocals of the number 
of fields, since one field is selected on each farm. The combined raising factor 
for the sampled field of the first farm in Table 6 . 19 . a is therefore 105, that for 
the sampled field of the third farm is 105 X 3 315, etc. 

TABLE 6. 19. a SURVEY OF FERTILIZER PRACTICE: DATA ON THE APPLICATION 

OF NITROGENOUS MANURES TO SUGAR BEET ON OLD ARABLE LAND IN NORFOLK 



No. 
oi 
fields 


Acreage 


Cwt. 

N 


No. 

of 
fields 


Acreage 


Cwt. 

N 


No. 
of 
fields 


Acreage 


Cwt. 
N 
per 
acre 


Total Sample ^ 


Total Sample P^ 


5 Total Sample 




Small 


farms 


Medium farms 


Medium farms (contd.) 


1 


2 


2 


68 


2 


13 


10 


42 


I 


12 


12 


16 


1 


6 


6 


63 


4 


40 


5 


63 


I 


6 


6 


30 


3 


5 


2 


55 


1 


8 


8 


15 


2 


14 


7 


42 


2 


4 


3 


42 


3 


14 


4 


42 


3 


21 


6 


14 


1 


5 


5 


36 


2 


51 


11 


63 


1 


8 


8 


42 


1 


3 


3 


21 


2 


16 


10 


90 


2 


10 


2 


49 


1 


4 


4 


21 


3 


19 


7 


21 


1 


4 


4 


30 


1 


2 


2 


42 


3 


31 


10 


84 


1 


4 


4 


30 


2 


6 


2 


42 


3 


39 


4 


63 


3 


14 


7 


45 


2 


8 


4 


90 


2 


19 


13 


52 


6 


42 


10 


36 


1 


2 


2 


52 


1 


9 


9 


52 


1 


6 


6 


21 


2 


4 


3 


15 


1 


20 


20 


42 


1 


4 


4 


30 


1 


6 


6 


70 


5 


26 


7 


30 


2 


19 


8 


42 


2 


5 


4 


36 


2 


8 


6 


54 




Large 


farms 




2 


6 


4 


56 


4 


20 


8 


21 


1 


8 


8 





I 


2 


2 





1 


4 


4 


42 


1 


48 


48 





1 


4 


4 


21 


1 


20 


20 


72 


3 


19 


4 





1 


5 


5 


48 


2 


7 


4 


36 


3 


56 


24 


68 


2 


3 


2 


10 


4 


32 


11 


63 


2 


22 


5 


36 


1 


7 


7 


32 


2 


16 


8 


82 


4 


30 


5 


42 


1 


2 


2 


80 


2 


6 


3 


63 


3 


20 


5 


90 


1 


4 


4 


30 


2 


20 


10 


56 


6 


126 


29 


7fi 










1 


7 


7 


57 


6 


28 


6 


63 



Number of farms without sugar beet on old arable land : small (6-50 acres), 8 ; 
medium (51-300 acres), 11; large (301- acres), 5. 

Sampling fractions (adjusted for absence of information) : small, 1/105 ; medium, 
1/59; large, 1/30. 

The average dressing of nitrogen must be obtained by calculating the 
raised total of the amount of nitrogen applied S(gy) and the raised total of 
the acreage sampled S (gx). The amount of nitrogen applied to a field is given 
by the product of the acreage and the rate per acre. The three size-groups 
are best kept separate in the computation. For the first size-group, therefore, 
applying the second-stage raising factors, we have 

S (g"y) *= S (g"rx) = I X 2 X 0-68+1 X 6 X 0-63 +3 X 2 X 0-55 + * . . 

= 1-36 + 3-78 + 3-30+ . , . 
)**l XZ+l X 6 + 3x2+ . t , 

=2+6+6+ . . . 
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This gives the results shown in Table 6.19.b. Applying the first-stage raising 
factors to the total nitrogen and total acreage, we obtain the average dressing 
of nitrogen per acre : 



r = 



46-13 X 105*+ 



28,512-04 



= -490 cwt. per acre 



104 X 105 + ... 58,229 
TABLE 6.19.b ESTIMATION OF AVERAGE DRESSING FROM RAISED RESULTS 



Size-group 


Total 
nitrogen 

S(g"y) 


Total 
acreage 
S(g'*) 


Nitrogen 
per acre 

s(e'y)is(f*) 


First-stage 
raising factor 
t 


Small 


46-13 


104 


444 


105 


Medium . 


285-41 


601 


475 


59 


Large 


227-64 


395 


576 


30 



The data presented comprise only a small part of the information collected, 
and the above method of estimation therefore demands a good deal of 
computation. For certain purposes comparative figures may be obtained 
from the straight averages of the dressings per acre on the sampled fields. 

TABLE 6.19.c ESTIMATION OF AVERAGE DRESSING 

FROM UNWEIGHTED MEANS 



Size-group 


No. of farms 


Nitrogen per acre 


Sum 


Mean 


Small 
Medium . 
Large 


22 
36 
9 


9-30 
16-32 
3-74 


423 
453 
416 


All 


67 


29-36 


438 



These averages are given in Table 6,19.c. The first-stage raising factors 
have not been used in this calculation, since the larger farms have more and 
larger fields in sugar beet, so that inequality in the omitted second-stage raising 
factors more than compensates for the difference in the first-stage raising 
factors. 
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It will be noted that the mean dressings are less than those previously 
obtained for all size-groups, indicating the possibility that farms with little 
sugar beet, which are overweighted in the straight averages, are using less 
nitrogen per acre than those with a large amount of sugar beet. The data are 
too variable, however, to determine with certainty from this sample alone 
whether this is really a bias or is due to random sampling errors. The large 
difference for the large farms, for example, is due to farm 8 having a very 
large acreage of sugar beet. The relative accuracy of the two methods of 
estimation, apart from bias, is discussed in Example 7.17. 

In the Survey of Fertilizer Practice the second method of estimation was 
used in investigation of secondary points, e.g. comparison of different types 
of farms. For the more important estimates, such as mean dressings per acre, 
a modification of the first method was used, the total acreages of sugar beet 
on the farms being taken as the raising factors for the second stage. This 
method of estimation is slightly more accurate than the first method given 
above, but will be biased if there is any tendency for farmers to apply heavier 
(or lighter) dressings to their large fields. There is no evidence that any 
appreciable bias does in fact arise from this cause, but even so it is perhaps 
doubtful whether there is much advantage in using this method of estimation 
rather than the unbiased method given above. The method would have been 
unbiased had selection of fields within a farm been made proportional to area, 
but this would have demanded somewhat more elaborate methods of selection 
in the field. 

6.20 Systematic and balanced samples 

The methods of estimation described in the preceding sections are also 
appropriate for systematic and balanced samples. Samples of these types 
without other restrictions, for instance, can be treated as if they were random 
samples for the estimation of the population values (but not for the sampling 
errors) ; if there is stratification the procedure for stratified random samples 
holds. An example of a systematic stratified sample with variable sampling 
fraction has already been given (Example 6.7). 

Certain estimation processes are naturally inappropriate to systematic and 
balanced samples. If the process of selection in a systematic sample is such, 
for example, that stratification is automatically introduced, there will be no 
gain from stratification after selection. Equally in a balanced sample the 
variate for which balance has been effected will be of no further value as 
supplementary information the balance ensures that the corrections based 
on regression, or ratio, will be zero, whatever the value of the regression 
coefficient. If each stratum is balanced separately, then the corrections for 
the different strata will all be zero, even if the regression coefficient or ratio 
varies from stratum to stratum. 

In systematic samples of material which varies in a continuous manner, 
some gain in accuracy may result from the use of what are known as 
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end-corrections. These corrections are made by assigning to the boundary 
observations weights which depend on their distance from the boundary. In 
systematic one- dimensional sampling of a line AB (Fig. 6.20), for example, with 

! , j , , ^ , , , , , _ , 

A I * Q > ?; Q * P; Q < ?, * * B 

FIG. 6.20 SYSTEMATIC SAMPLE, P It P a , . . . P 6 , OF THE LINE AB 

sampling points located at P 19 P 2 , . . P 6 , if Q l9 Q 2 , . . . Q 5 are the mid- 
points of PPfr etc., we may regard the observations at P 2 , PS> Pa PS as 
estimates covering the lengths QiQ^ QzQs> etc. The observations at P 
and P 6 can similarly be regarded as estimates covering the lengths AQ and 
Q 5 B. Consequently the weights assigned to P t and P 6> relative to that assigned 
to P 2 , P 3 , etc., will be AQ./Q.Q, and QJBI&Q* 

The same principle, or an adaptation of it, might be applied in the case 
ol two-dimensional systematic sampling of areas. End-corrections, however, 
are not likely to be of much value in the type of material usually dealt with 
in census and survey work, and we shall not discuss them further here, beyond 
mentioning that if the regions near the boundary differ from the remainder 
of the area, the use of end-corrections, instead of separate treatment of the 
boundary regions in the manner outlined in Section 3.14, will lead to biased 
estimates. 



6,21 Sampling on two successive occasions 

The most straightforward procedure for estimating the values of the 
population mean on two successive occasions is to treat each occasion 
separately, following whatever method of estimation is appropriate to the 
sample obtained on that occasion, regardless of the values obtained on the 
other occasion. Such estimates may be termed overall estimates. 

With independent samples on the two occasions, or with the same fixed 
sample on each occasion, the overall estimates will contain virtually all the 
available information, but where a sub-sample is taken on the second occasion, 
or there is partial replacement of the sample on the second occasion, the 
situation is more complicated. 

If the sample on the second occasion is confined to a sub-sample of the 
original sample, change will be most simply estimated from the differences 
of the units included in the sub-sample only. An estimate of the population 
mean or total on the second occasion is similarly obtained by adding the 
estimated change to the overall estimate on the first occasion. The most 
accurate estimate of the population mean on the second occasion will, however, 
be obtained by calculating the regression of the sample values for the second 
occasion on the corresponding values for the first occasion, and using the 
sample values for the first occasion as supplementary information. The 
procedure is exactly the same as has been already outlined for use of 

175 



SECT. 6.21 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

supplementary information by the regression method, the method appropriate 
to the type of sampling used being followed. The most accurate estimate of 
the change can then be obtained by taking the difference of this estimate of 
the mean from the overall estimate for the first occasion, 

The formulas for this procedure are as follows. Denote the sample values 
on the first occasion by x and those on the second occasion by y, the values 
belonging to units included in both samples by *', y', and those included on 
the first occasion only by *". If a fraction X of all the units included on the 
first occasion are taken on the second occasion, and a fraction /* equal to 1 A 
are omitted, then, for a random sample, 

x = M + p*" 



where x is the overall estimate for the first occasion and y the adjusted estimate 
for the second occasion. The change is consequently estimated by 

y - x =/ - Sf - p (1 - *) (*" - *0 

The calculation of the regression coefficient is based on the values of the units 
which are included on both occasions. 

If the changes of individual units are small compared with the differences 
between units, i.e. if the correlation between units on the two occasions is very 
close, as is likely to be the case when this type of sampling is adopted, b will 
be nearly equal to unity, and the estimate of change will differ little from that, 
y' #', derived from the units included on both occasions only. Equally 
the estimate of the population mean or total will differ little from that obtained 
by adding the estimate of change to the overall estimate on the first occasion. 

When a sample of the same size is taken on each occasion with partial 
replacement, a fraction /i of the units being replaced and a fraction A being 
retained, the sample units which are retained can be used to furnish^ an 
estimate fi of the population mean on the second occasion by the regression 
method already given. In addition there will be a further independent 
estimate y 2 , equal to f", derivable from the sampling units which are included 
on the second occasion only. The most accurate estimate y w will be provided 
by a weighted mean of these two estimates. The correct weights are 
%j(l _ ^v 2 ) and /* (1 /*r 2 )/(l p V), where r is the correlation coefficient 
between the unit values on the first and second occasions. 

The correlation coefficient is calculated in the same manner as the regression 
coefficient b, using the values of the units common to both occasions, with 
the exception that instead of dividing by a quantity of the type S (x #) 2 
we divide by the corresponding quantity of the type ^{S (x #) 2 S (y y)*}. 
Thus, for a pair of random samples, 






176 



ESTIMATION OF THE POPULATION VALUES SECT. 6.21 

In the more complicated types of sampling the sums of squares and products 
are modified in the same manner as in the calculation of i. 

We thus have 



If the numbers in the sample on the two occasions are not the same the 
above formula takes the modified form 

_ ' {/ + * (* -*')} + n" (1 - yr'-) y" 

yiv = - ' + B ''(i_ Air i) - < 6 ' 21 - a > 

where n f is the number of units re-sampled on the second occasion, n" is the 
number of new units, and JLI is the proportion of units sampled on the first 
occasion which are not re-sampled on the second occasion. 

An estimate of the change can similarly be obtained by taking the weighted 
mean of the two estimates, y' x f and y" #". The weights to be assigned 
to these estimates are A/(l fir) and JM (1 r)/(l JUT), so that 



Change = ^-^ (y' - *') + _~ (y" - *") (6.21 .b) 

This estimate of the change will differ from that given by the difference 
of y w and the overall estimate on the first occasion. The reason for this is 
that once the sample for the second occasion has been taken, a more accurate 
estimate of the population mean on the first occasion is possible by using the 
information provided by the sample on the second occasion as supplementary 
information. If this revised estimate x w is calculated, then the estimate of 
change given above will be very nearly equal to y w x w . The slight discrepancy 
arises from the fact that unless the variances on the two occasions are equal 
the estimate of change given above is not quite the most accurate possible. 

It will be noted that when r is equal to 1, the above estimate of change is 
equal to y' #', whereas if r equals the estimate is equal to the difference 
of the overall estimates of the population means. Similarly, if r equals 0, 
YW equals the overall mean of y , and if the values of each unit are the same on 
both occasions (#' = y', r = 1, b = 1), YW equals the mean of all the sample 
values included on each occasion, each value being included once only. 

A further practical point arises in connection with the estimation of b. If 
the variability on the two occasions is the same, both the regression coefficients, 
y on x and x on y, will be equal to the correlation coefficient. In most material 
for which sampling on successive occasions of the type under consideration 
is likely to be used, the variability on the different occasions may be expected 
to be very similar, and for such material it is best to replace b by r, as the 
latter is less subject to errors of estimation* 
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Example 6.21 

The percentage solids-not-fat in two successive months for all the 16 cows 

in a herd which were in their 2-6 months of lactation in one, at least, of these 

months were observed to be : 

Cow ..12345678 
November . 8 '82 8-94 9-86 8-90 9-00 9-13 8-90 9-02 
December . _. 8-98 8-66 8-68 8-86 

Cow 10 11 12 13 14 15 16 

November . 9-46 9-52 9-28 9-22 

December 9-30 9-50 9-13 9-32 9-38 8-78 9-10 9-04 

Estimate the change in percentage solids-not-fat between the two months, 
the mean percentage for December and the revised mean percentage for 
November. 

The sampling is here due to natural causes, and the number of cows in 
their 2-6 months of lactation will therefore not remain completely constant. 
The above two months were selected from more extensive records. 

We find : 

(#') = 73-53 (/)== 72-43 

S (*") = 36 -52 S (/') = 36 -30 

*' = 9-1912 /== 9-0538 / - x' = - 0-1374 

*"== 9-13 /'= 9-075 /'-#" = - 0*055 

= 9-1708 j? = 9-0608 j7 # = 0-11 

s ^ -.)*=* Q-3435 S(x f -x')(y-y') S (/-/)' = 0-6742 

= 0-4076 
r = 0-4076/v / (0-3435 X 0-6742) = 0-847 

It will be noted that the value of b is greater than unity, which illustrates 
the point made above that r provides a better estimate of the regression in 
material of this kind. The estimation of r should normally be based on more 
extensive data, though no very high accuracy is required 100 pairs of values 
will be fully adequate. In the present case the more extensive data confirm 
that the above value of r is about correct. 

We also have A = f , JJL = J, and thus 
- 0-724 



1 fJTT 2 

_A_ = o-929 ^ (1 ^ r) =0-071 

I jur I fir 

y w = 0-724 {9-0538 + 0-847 (9-1708 - 9-1912)} + 0-276 X 9-075 = 9-0471 
Change = 0-929 (- 0-1374) + 0-071 (~ 0*055) = - 0-1315 
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The estimate of the mean for November, revised on the basis oi the 
December values, is 

5^ = 0-724 {9-1912 + 0-847 (9-0608 9-0538)} + 0-276 X 9-13 = 9-1786 

giving the check, 9-0471 9-1786 = 0-1315. Agreement is here exact, 
since the two regression coefficients, y on sc and x on y, have both been taken 
equal to r. 

8.22 Sampling on a number of successive occasions 

The formulas of estimation given in the last section cover all cases of 
sampling on two occasions only. When sampling is carried out with partial 
replacement on more than two occasions no such simple general solution is 
possible, but certain approximate solutions, which are very similar in form to 
those for sampling on two occasions, are likely to be sufficient for most practical 
purposes. 

In a sampling scheme which is repeated at intervals it is generally desirable 
to provide as accurate an estimate as possible of the population mean on each 
occasion without any revision of the estimates for previous occasions. Suppose 
that y/i is the most accurate estimate which can be obtained for occasion h, 
taking into account the results of the sampling up to and including this 
occasion A, and that y^-i is the similar estimate for occasion h 1, taking 
into account the results up to and including occasion h 1. Subject to certain 
limitations, y h and y/j-i are related by a formula of the type 

y ft = (i-9>) {y h ' + r(y h -i-[y h '-J)}+<py h " (6.22. a ) 

where suffices indicate the occasion, single dashes units common to occasions 
h and k 1, the mean on the earlier occasion being distinguished by square 
brackets, and double dashes units occurring on occasion h only. 

The limitations are that a given fraction of the units is replaced on each 
occasion, that the variability on the different occasions and the correlation 
r between successive occasions are constant, and that the correlation between 
occasions two apart is r 2 , that between units three apart is r 3 , etc. This last 
condition is only necessary when units are included for more than two occasions, 
and no great loss of accuracy will occur under normal circumstances if it does 
not hold exactly. 

The value of 9? depends on the value of r, on the fraction ^ replaced on 
each occasion, and on the number of occasions h on which samples have already 
been taken. With increasing h, <p rapidly tends to a limiting value, which 
depends only on r and / a. This limiting value is 






9S 

The values for h = 2 have been given in the previous section. For 
practical purposes the limiting value of <p may be used for ail occasions after 
the second (the above formula for 9? is due to Patterson (1950, A') ) . 
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When the value of r has been determined, the values of <p can be calculated 
and formula 6. 22. a used. 

For most practical purposes y h y h ~i will provide an adequate estimate 
of the change between occasions h 1 and h. If change is of particular 
interest, however, formula 6.21.b may be used.* This latter estimate will 
of course not agree exactly with y h y h ~ i and will therefore lead to apparent 
inconsistencies in the summary of the results. 

It sometimes happens that the sampling scheme, though broadly following 
a partial replacement procedure, gives rise to some inequality of numbers of 
units on the different occasions. This can be allowed for by substituting for 
(p the value cp f given by 

' <p (6.22.b) 



where nh is the number of units on occasion h, and nn' is the number of units 
not included on the previous occasion. 

TABLE 6.22 PERCENTAGE SOLIDS-NOT-FAT: ADJUSTMENT OF SAMPLES TAKEN 

ON SUCCESSIVE OCCASIONS 





January 


February 


March 


April 


May 


June 


7 


5 9-400 


11 9-090 


9 9-111 


10 9-059 


8 9-211 


11 9-345 


y\ 





4 9-288 


8 9-122 


8 9-086 


3 9-060 


7 9-326 


y\ 


_ 


7 8-977 


1 9-020 


2 8-950 


6 9-302 


4 9-380 


(yVr 





9-341 


9-163 


9*188 


9-226 


9-322 


y 


9-400 


9-122 


9-151 


9-152 


9-262 


9-338 


Fv'fcl 


4 9-335 


8 9-072 


8 9-025 


3 8-947 


7 9-267 





y __ Fv'*."] 


+ -065 


-f -050 


H- -126 


-j- -205 


- -005 





<p' . . 





603 


084 


151 


472 


-275 


From 
differences 


9-400 


9-353 


9-403 


9-464 


9-577 


9-636 



* To obtain the most accurate possible estimate of change the information from 
occasions prior to h 1 would have to be taken into account. The procedure has been 
investigated by Patterson (1950, A'). 
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Example 6.22 

Similar data on percentage solids-not-fat to those of Example 6.21 are 
given in abstract form in Table 6.22 for the months January-June. Only 
the 3-5 months of lactation are included. Obtain estimates of the mean 
percentage in successive months. 

The table shows for each month the overall mean y h > the mean of cows 
occurring in the previous month $/, and the mean of new cows y h ". The 
numbers of cows on which these means are based are also shown. The mean 
for the month h 1 of cows occurring in months h and h 1 is shown 
in the line [$/] in the column for month h 1. Thus 9-335 and 9-288 are 
the means for January and February of the four cows occurring in both these 
months. 

Summation of the sums of squares and products of deviations of pairs 
of entries for successive months from January to December gives an overall 
value for r of 0*811, so that r z is 0*657. The similar calculation of the 
correlation between months two apart gives / equal to 0*746. The assumption 
that r' equals r 2 therefore somewhat underweights the information obtainable 
from occasions two apart. 

The average value of ^ over a long period will be 1/3, but considerable 
fluctuations in numbers occur from month to month. For ju, = 1/3 the value 
of y for occasions subsequent to the second is 

- (1 _ Q657) + V(l - Q-657){1 - 0*657 (1 - 4.2/3.1/3)} = ^^ 
9 ** 2 x 2/3 X 0-657 

Hence for March 

^ ^ 9J3 ' 252 = ' 84 

etc. These values are shown in Table 6.22. 

For the second occasion formula 6. 21. a may be used. This is of the same 
form as formula 6. 22. a, and gives 

7 (1 - 0-657/5) 

= 6 3 



= 4+ 7(1- 0-657/5) 

Had the value for p = 1/3 been calculated, and corrected by means of formula 
6.22.b, we should have obtained 



<p' = ~ 0-281 = 0-536 

1 1 fo 

which does not differ greatly from the correct value. Equally 9? differs little 
from the value 0-252, obtained above for subsequent occasions. 
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The remainder of the calculations follow a standard pattern. The quantity 
1 {y h }, equal to y h f + r(y h ~ l [y h ' - J), is calculated, and the weighted mean 
of y h " and {y h } taken, with weights equal to <p f and 1 9?'. Thus for 
February 

{y h } = 9-288 + 0-811 X (+ 0-065) = 9-341 
y h 0-603 x 8-977 + 0-397 x 9-341 = 9-122 

The overall estimates y h and the estimates from differences j>// [y h f .. j] 
are shown for comparison. The differences show a tendency to cumulative 
errors, which is to be expected even with close correlation. 

It will be seen that once a value for r has been determined, and provision 
has been made to abstract the means y h \ y h ", and [y h f ^ a ], the calculations 
are very simple, and can easily be undertaken for large-scale surveys, even 
when a number of different quantities require estimation. 
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CHAPTER 7 
ESTIMATION OF THE SAMPLING ERROR 

7.1 Sampling errors of a random sample 

The general principles involved in the estimation of sampling errors can 
best be made clear by considering the error of a random sample drawn from 
a large population. 

Consider first a sample consisting of a single unit. Let the mean of the 
population be y, and let the deviations of the individual values from this mean 
be #!, z z , . . . , so that z^ =y 1 y, # 2 = y 2 y, . . . Then the 
actual error in the estimate of the mean from a sample of one unit will be z r , 
where r is the selected unit. 

The mean of all the #'s is zero, and therefore the average of the errors 
of the estimates from a large number of samples of one unit (having regard to 
the signs of the errors) will approximate to zero. This is equivalent to saying 
there is no bias in the estimate. 

In order to obtain a measure of the magnitude of the expected error we 
must therefore obtain some form of average of the #'s which does not take 
account of sign. One simple measure which might be taken is the average 
of all the #'s without regard to sign, but an alternative measure, which has a 
number of statistical advantages, is provided by the mean of the squares of 
all the z's. This is termed the mean square deviation ofy or the variance of y, 
and is denoted by V ( y) or a 2 , and its estimate by V ( y) or s*. The square 
root of this variance is termed the standard deviation cr of a single unit. 

In the same way, if a sample contains a number of units we may define 
the sampling variance of an unbiased estimate, say y, derived from such a 
sample as the mean of the squares of the actual errors of a large number of 
samples of the same size. This variance will be denoted by V(y), or, if 
estimated, by V (y). The square root of this variance is generally termed the 
standard error of the estimate, and will be denoted by S.E. (y), or, if 
estimated, by S.E. (y). The term standard error is also sometimes applied 
to the standard deviation of a single unit, particularly when the deviations are 
in the nature of errors of observation. 

The standard error of the estimate of the population mean derived from 
a sample of one unit is therefore equal to the standard deviation of a single 
unit. If a sample of two units r and s is taken, the actual error of the estimate 
of the population mean will be 

y-y=;p-y = i(*r + * s ) 

The standard error of the estimate will therefore be given by the square root 
of the average value of 

i (Xr + *) = J (*r 2 + #s 2 + 2#r X 8 ) 

If the population is large the average value of z r Zs is zero, as can be seen if 
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we consider a series of samples having the same first unit r and different second 
units $. The average values of z r z and z s 2 are both a 2 . Hence the average 
value of the above expression is J a 2 . Consequently the standard error of the 
estimate is a/y'S. 

It will be noted that the above argument does not depend on the form of 
the distribution of the z's there is no need, for example, for positive and 
negative deviations to be equally frequent. It does, however, require that 
each unit of the sample shall be randomly and independently selected. If, 
for instance, there were a tendency to select a second unit with a deviation 
similar to the first unit the average value of ZT %s would not be zero. 

The argument can easily be extended to a sample of n units, for which the 
variance and standard error of the estimate will be found to be 

S.E.(y) = ^a (T.l.a) 

We thus have the important general result that the standard error of the 
estimate of the mean of a large population from a random sample is inversely 
proportional to the square root of the number of units in the sample. 

The standard error of the estimate of the total follows immediately from 
the rule that if / is any multiplier the standard error of /y is equal to I times 
the standard error of y, provided / is not subject to sampling variation. Thus 

S.E. (Y) SJE. (gny) =gn {S.E. (y)} =**/* (7.1.b) 

Although S(y) is not itself an estimate it is often convenient to consider its 
sampling variance or standard error. From the above rule 



The standard error of an estimate can be expressed as a percentage of the 
population value of the estimated quantity* This form of expression is useful, 
as the percentage standard error is unaffected by the units in which the estimate 
is expressed, and the percentage standard error of the mean, of the total of 
the sample, and of the estimate of the population total are all equal. Similarly 
the standard deviation of a single unit can be expressed as a percentage of the 
mean value of a single unit. This is sometimes termed the coefficient of variation. 
Denoting it by a %, we have, in a large population, 

S.E. % (y) - S.E. % {S (y)} - S.E. % (Y) (a 



Thus in a population with a percentage standard deviation per unit of 20 per 
cent., the percentage standard error of the estimate of the population mean 
or total from a sample of 100 will be 2 per cent., that from a sample of 400 will 
be 1 per cent., etc. 

In order to estimate the standard error of the mean or total in numerical 
terms an estimate of the value of a will be required. This can be obtained 
from the deviations y y of the numerical values of the selected units from 
their mean, y y will be nearly, though not exactly, equal to #, and to a 
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first approximation an estimate of a 2 will therefore be provided by the mean 
square deviation S ( y y) 2 /n. Actually the sum of the squares of the 
deviations from the sample mean is always less than the sum of the squares of 
the deviations from the population mean, as can be seen from the identity 

S (y - y) 2 = 5 (y - y) 2 - n (y - y) 2 

The average value of the first term on the right-hand side is no- 2 , and the 
average value of the second term is a 2 , since y y is the error in the estimate 
of the mean. Thus S ( y y) 2 has an average value of (n 1) o 2 , and 
consequently an estimate of a 2 is given by 



The divisor n 1 is technically known as the number of degrees of freedom 
associated with the estimate of error, and is equal to the number of independent 
comparisons that can be made between n values. 

The calculation of the sum of the squares of the deviations S (y j;) 2 
is best done from the sum of the squares of the values themselves. By this 
procedure the calculation and squaring of the individual deviations, which 
often involve fractional values, is avoided. One of the expressions 



is used. The last term of each of the three expressions is usually termed 
" the correction for the mean." In calculating it from one of the first two 
expressions, y must be taken to at least as many significant figures as are 
required in the correction. For this reason the last expression is often the 
most convenient. 

Sometimes it pays to use some convenient round number y Q as a working 
mean, in which case we have 



etc. 

If a calculating machine is available the individual squares should not be 
written down the sum of squares can be obtained directly by squaring the 
numbers successively without clearing the machine. 

The calculation of the sum of squares from grouped data is illustrated 
in Examples T.l.b and 7.2.b. 

Example 7.1. a 

Estimate the standard error of the estimate of the mean of the population 
(assumed large) of which the values of Table 6.4 are a sample. 
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The computations are as follows ; 
= 20 
= 193-8 

= 9-69000 

9 81-20 

* ~ 19 

S.E.(y) = 



S(y*) = 1959 -12 
yS(y) = 1877-92 

*= 81-20 



= 4-274 = 2-07 2 



0-462 



Example 7.1.b 

Table 7.1 gives the distribution of family income in a sample of 162 white 
families in Norfolk-Portsmouth, Virginia. Calculate the mean income of the 
sample and the sampling standard error of this mean. 

TABLE 7.1 ANNUAL NET INCOME OF A SAMPLE OF 162 WHITE FAMILIES 
IN NORFOLK-PORTSMOUTH, VIRGINIA, 1934-6 



Annual 
net 
income 

$ 


No. of 
families 


Working 
units 


Calculation 


Calculation by 
successive 
summation 


Total 
(2) x (3) 


Sum of 
squares 
(2) x (3)* 
- (3) X (4) 


Total 


Sum of 
squares 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 


(7) 


600- 


10 


- 3 


-30 


90 


10 


10 


900- 


23 


-2 


- 46 


92 


33 


43 


1,200- 


40 


i 


- 40 


40 


73 


116 


1,500- 


32 











116 




1,800- 


28 


4-1 


4- 28 


28 


57 


105 


2,100- 


20 


+ 2 


4-40 


80 


29 


48 


2,400- 


4 


4-3 


4-12 


36 


9 


19 


2,700- 


2 


' +4 


4- 8 


32 


5 


10 


3,000- 


1 


+ 5 


4- 5 


25 


3 


5 


3,300- 


2 


4-6 


+ 12 


72 


2 


2 


162 


- 11 


495 


105 


358 
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With grouped data of this type it is best to use the group interval as the 
working unit and the central value of one of the central groups as the working 
mean. The group $1500-1799 (central value, $1649*5, since the data were 
rounded off to the nearest dollar before grouping) has been chosen. The 
calculation of the total and sum of squares in these units is shown in columns 4 
and 5 of the table. The mean of the sample in working units is therefore 
11/162, i.e. 0-06790, and in the proper units is 

1649-5 0-06790 X 300 = 1629-1 

The sum of squares of the deviations in the working units is 
495 0-06790 X 11 = 494*25 

and in the proper units is therefore 494-25 x 300 2 , i.e. 44,482,000. Hence, 
dividing by 161, s 2 = 276,290, and the sampling standard error of the mean 
income is V(276, 290/1 62) = 41-3* 

If no calculating machine, or only an adding machine, is available the 
alternative form of calculation shown in columns 6 and 7 may be preferred 
Column 6 is formed from column 2 by successive summation from the ends. 
Column 7 is similarly formed from column 6. Note the check 
73 + 57 + 32 = 162 for column 6, and the checks of the final values for 
column 7 from the totals of column 6. The total in working units is then 
given by the difference of the totals of the two halves of column 6, i.e. by 
105 116 = 11. The sum of squares is obtained by doubling the total 
of column 7 and deducting the sum of the totals of the two halves of column 6, 
i.e. by 2 x 358 105 116 = 495. 

7.2 Sampling from a finite population 

The above theory requires modification in two respects if the population 
is not large. In the first place it is best to define or 2 as 



where Sp denotes summation over the whole population. This is equivalent 
to regarding the population as itself a random sample from an infinitely large 
population with variance <A With this definition of c 2 formula 7.1,d for 
s 2 stands without modification. In certain textbooks and scientific papers 
the alternative definition with divisor N is adopted. This introduces the 
factor (N 1)/N into the formula for s 2 , and leads to other complications 
in the discussion of the errors of sampling from finite populations which are 
avoided by the use of the first definition. 

In the second place, the formula for the standard error of the estimate of 
the mean or other estimate requires modification by the introduction of the 
factor VC 1 /)> or more strictly \i(l f ). Thus we have 



S.E.(y)=a x r-^ (7.2) 
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That the introduction of some factor of this kind is necessary is obvious, since 
if the whole population is included in the sample (/= 1) the sampling error 
will be zero. The actual factor can be deduced by an extension of the algebraic 
analysis given above. . 

It should be noted that the factor V(l ~/) should not be int duced 
when testing the difference between the means of two sampled populations 
to see whether, for example, they are subject to different causal agencies. 
In this case we are concerned to determine whether there is a real and consistent 
difference running through all the units of the two populations : in other 
words, we wish to test whether the two samples can reasonably be regarded 
as random samples from a single infinitely large parent population, or whether 
they have to be regarded as samples of two different parent populations. 

Example 7. 2. a 

Estimate the standard errors applicable to the estimates obtained in 
Example 6.4.b. 

From Example 7.1. a, s 2 = 4-274, and consequently 

/ f 4-274 /- 1 \1 , A ,,o 

S.E.CO-A/ -sr 1-s == 0453 



20 

V. " ' * 

and since 

Y = 500 y 

S.E. (Y) = 500 x 0-453 = 226 
S.E. (Y') = 507 x 0-453 = 230 

Example 7.2.b 

Estimate the sampling error of the wheat acreage from the random sample 
of 125 farms of Table 6. 6. a. 

We find : 

n = 125 S (y) = 2301 y = 18-4080 

S (y 2 ) = 207,261 S (y - y ) 2 = 164,904 * 2 = 164,904/124 = 1329-9 
S.E. (y) = V{ 1329-9 (1 - i&)/125} = 3-18 

S.E. (Y) = 20V{125 X 1329-9 (1 - ^)} = 20 X 125 X 3-18 7950 
The calculation of the mean and of the sum of squares of deviations may 
alternatively be carried out by grouping the data. The groups should be so 
chosen that the distribution within any group containing a substantial number 
of values is reasonably even. A grouping interval of 10 acres is here convenient, 
but because of the large number of zeros these must be included in a separate 

group. 

The grouped data and the calculation of the total and sum of squares are 
shown in Table 7 . 2. The calculations are carried out in terms of working 
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values in units of the grouping interval, and a working mean of 24-5 (the mean 
of group 4) is taken. The total is obtained from column 4 and the sum of 
squares from column 5. The mean in terms of the grouping interval is therefore 

(- 209-75 + 136)/125 = 73-75/125 = 0-590 
and in terms of the proper units is 

y = 24-5 0-590 X 10 = 18-60 
Similarly 

* 2 = (1719-2 - 73-75 X 0-590) x 10 2 /124 = 167,570/124 1351-4 
The rest of the computations proceed as before. 

TABLE 7.2 CALCULATION OF THE MEAN AND VARIANCE FROM GROUPED DATA 
(WHEAT ACREAGES OF TABLE 6. 6. a) 



Acres 


Number 


Working 
value 


Total 
(2) x (3) 


Squares 
(2) x (3)* 





80 


__ 2-45 


- 196-00 


480-2 


1- 


5 


- 1-95 


- 9-75 


19-0 


10- 


4 


- 1 


4 


4 


20- 


11 





- 209-75 




30- 


2 


+ 1 


T~! 


2 


40- 


2 


+ 2 


+ 4 


8 


50- 


4 


+ 3 


-f- 12 


36 


60- 


4 


+ 4 


+ 16 


64 


70- 


4 


+ 5 


+ 20 


100 


80- 


3 


+ 6 


+ 18 


108 


90- 


1 


4- 7 


+ 7 


49 


100- 


3 


+ 8 


4- 24 


192 


110- 


1 


4- 9 


-f 9 


81 


260- 


1 


+ 24 


-f 24 


576 




125 




4- 136 


1719-2 



If the data are fully tabulated, grouping is scarcely worth while for so 
small a body of data, even when a calculating machine is not available 
especially when, as here, the existence of zeros complicates the grouping while 
simplifying the direct calculation of the sum of squares. With material on 
punched cards, however, the data can be most easily and compactly presented 
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in grouped form, and the advantages of this form of computation therefore 
become much greater, particularly when the number of values is large. 
Grouping also enables the form of the distribution to be much more easily 
comprehended. In the present data the relatively large number of values 
between 20 and 30, and the single very high value of 265, are immediately 
apparent. 

7,3 The normal law of error 

The above analysis shows that it is possible, from the numerical values 
of the selected sampling units, to estimate the standard error of the estimate 
of the mean of the population. This gives us a measure of the average error 
to be expected. The analysis has not, however, given us any indication of 
the frequency with which errors of different magnitudes may be expected to 

occur. . 

It is a matter of common observation that in most material which is subject 
to quantitative variation large deviations tend to occur less frequently than 
do small deviations. In much material, also, positive and negative deviations 
occur with about equal frequency. The exact distribution of the deviations 
of individual sampling units will, of course, vary considerably indifferent 
types of material, but it is a fortunate circumstance that, over a wide range 
of distributions of the parent material, the errors to which estimates such as 
the mean, total, etc., are subject are distributed approximately according to 
what is known as the normal law of error, i.e. in a normal distribution. Other 
things being equal, the larger the sample on which the estimate is based, the 
more closely is the law followed. If the deviations of the original material 
are normally distributed, the errors of the estimate of the mean, etc., will 
conform exactly to a normal distribution. 

In a normal distribution the frequency with which deviations within the 
infinitesimal range z to z + d% may be expected to occur is given by the 
expression : 



where c is the standard deviation, and e is the base of Napierian logarithms, 
2-71828 approximately. 

Fig. 7.3 shows normal distributions with standard deviations a = 1 and 
ts = 2. The vertical scale represents the frequency with which deviations 
within a range of 0-1 of z occur per 1000 values. Thus the ordinate at z = 
for a = 1 is 39*9, which indicates that on the average 39-9 values per 1000 
may be expected to have deviates having values between 0-05 and + 0-05. 
The value 39-9 can be derived from the above formula by putting a *= 1, 
* = 0, dz = 0-1 and multiplying by 1000. 

From the figure it will be seen that positive and negative deviations of a 
given magnitude occur with equal frequency, and that large deviations are 
much less frequent than small ones. We are, however, in general not so much 
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concerned with the frequency of a deviation of any particular magnitude, as 
with the frequency with which deviations greater than a given magnitude 
may be expected to occur. These latter frequencies, which correspond to 
areas in Fig. 7.3, are shown in Table A. 2 at the end of the book, for various 
yalues of %/a. The area for #/a = 1 is shaded for both curves. 

From Table A.2 it will be seen that 61-7 per cent, of all values have a 
deviation or error (positive or negative) greater than one-half the standard devia- 
tion or standard error, 31-7 per cent, of all values have a deviation greater 
than the standard deviation, but only 4-6 per cent, have a deviation greater than 
twice the standard deviation, and only O27 per cent, will have a deviation 
greater than three times the standard deviation. Consequently, if we know 




-4 



FIG. 7.3 NORMAL FREQUENCY DISTRIBUTIONS WITH STANDARD DEVIATIONS <y = 1 
AND a = 2. THE FLATTER CURVE is THAT FOR a = 2. 

The shaded areas represent the total frequencies of the values for which the actual 
deviations are greater than the standard deviation. 

that an estimate is subject to the normal law of error and has a given standard 
error, we can assign probable limits of error to this estimate. If limits of plus 
or minus twice the standard error are taken then in only 4*6 per cent, of the 
cases will the actual error lie outside these limits. 

An alternative form of this statement, utilizing fiducial probability, which 
has certain logical advantages, is as follows. 

If the true value of the mean is equal to the estimate minus twice the 
standard deviation (the lower limit of error) then a value of the estimate as 
high as, or higher than, that actually observed will be obtained in only 2-3 per 
cent, of all samples. (The values of Table A.2 are divided by 2 since deviations 
in only one direction are involved.) Similarly, if the true value of the mean 
is equal to the estimate plus twice the standard error (the upper limit of error), 
a value of the estimate as low as, or lower than, that actually observed will be 
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obtained in only 2-3 per cent, of all samples. For limits of plus or minus 
once the standard error the corresponding percentages are 16 per cent. These 
closer limits are useful as indicating the region within which, or in the fairly 
close neighbourhood of which, the mean is likely to lie. 

As pointed out above, we usually only have an estimate of the standard 
error, which is itself subject to error, the accuracy being dependent on the 
number of degrees of freedom, and statements in the above form are therefore 
not exact. 

The effect of inaccuracy in the estimate of error on fiducial statements can 
be allowed for by the use of what is known as the t distribution, instead of the 
normal distribution. In general, however, inaccuracies due to paucity of data 
are not sufficiently great for this to be necessary in census work. More important 
is the fact that with a number of types of sample frequently employed, e.g. 
systematic samples and stratified samples with one unit from each stratum, 
fully valid estimates of error are not available. The estimates of error actually 
obtained in such cases are usually overestimates of ^ the sampling standard 
errors, and any exact fiducial statement is therefore impossible. 

In a random sample of n (n 1 degrees of freedom) from a normal 
distribution with standard deviation a the standard error of s is given by 



If the material conforms approximately to the normal law of variation, an 
estimate based on 50 degrees of freedom will therefore determine the sampling 
error with a standard error of 10 per cent, and an estimate based on 200 degrees 
of freedom with a standard error of 5 per cent. If the material does not conform 
to the normal law the accuracy may be substantially less. 

Example 7.3 

Assign limits of error to the estimate of the mean of the population (assumed 
large) of the values of Table 6 . 4, (The values are actually a random sample 
from a normal distribution with mean 10 and standard deviation 2.) Show 
that they are distributed in the expected manner. Calculate also the standard 
error of the estimate of the standard deviation. 

The results obtained in Example 7.1. a indicate that the true value of the 
mean is not likely to lie far outside the range 9-69 046, i.e. 9-23-10-15, 
and is fairly certain to lie within the range 9-69 2 X 046, i.e. 8-77-10-61. 

The distribution of the 20 values is given in Table 7.3. Integral values 
have been allotted \ and | to the two appropriate classes. Each interval in 
this grouping is equal to 0-5 a. From Table A.2 the proportionate frequency 
of ( bservations with deviations greater than 0-5 a (positive or negative) is 
0-6171, and consequently the expected frequencies of observations between 
9 and 10 and between 10 and 11 are each 20 X i (1 0-6171) = 3-83. 
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TABLE 7.3 DISSERVED AJSTO EXPECTED FREQUENCIES IN THE SAMPLE 
OF TABLE 6.4 FROM A NORMAL DISTRIBUTION 

Range . < 6 6-7 7-8 8-9 9-10 10-11 11-12 12-13 13-14 14-15 > 15 Total 
Observed 12-55 2-5 5-5 10-51 1 20 

Expected 0-45 0-88 1-84 3-00 3-83 3-83 3-00 1-84 0-88 0-33 042 20-00 

Similarly the proportionate frequency of observations greater than 1 -0 o 
is 0-3173, and consequently the expected frequencies between 8 and 9 and 
between 11 and 12 are each 20 X \ (0-6171 0-3173) = 3-00. In this manner 
all the expected frequencies shown in Table 7.3 can be calculated. The 
observed frequencies conform satisfactorily to the expected frequencies. 
From formula 7.3 the standard error of the estimate s is 



The actual value of $, 2-07, is therefore closer to the true value than will occur 
on the average in samples of 20. 

7.4 Qualitative variates 

From the procedure developed for random samples it will be seen that 
the estimation of the sampling errors of estimates derived from a quantitative 
variate can be divided into two distinct stages, the first being the estimation 
of the variability of the individual sampling units (or more strictly the part 
of the variability which contributes to sampling error), and the second the 
derivation of the standard errors of the estimates in terms of the variability 
of the individual sampling units. 

The same principles hold when the variate under consideration is 
qualitative. In the case of a random sample, however, the variability of an 
attribute of the sampling units depends only on the proportion of units 
possessing the attribute in the population. Hence in random samples no 
estimate of the variability of the individual sampling units is required. 

For a random sample from a large population, if q = 1 p, we have 



If the population is finite and the sampling fraction/ is appreciable the formula 
becomes, to all necessary accuracy, 

s.E.( P )V{pq(i-/)M 

The exact expression is obtained by replacing (1 /) by (N )/(N 1), 

Similarly 

V(a)=pq(l-/) 
and hence 

S.E. (U) =sV{pq(l -/)} = N{ S.E. (p)} 
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Fig, 7.4 shows the way in which the standard error of the estimated 
percentage S.E. (100 p) varies with the percentage 100 p in samples of 100 
and 1000 from a large population. The actual values of S.E. (100 p) are shown 
by the full line, while the dotted line gives the percentage standard error 
100SJE.(100p)/100p. 




10% 



90% 100% 



20r 8 ,30% 40% SOX 60% 70% 
PERCENTAGE IN POPULATION, 100 p 

FIG. 7.4 STANDARD ERRORS OF THE ESTIMATED PERCENTAGE OF UNITS HAVING A GIVEN 
ATTRIBUTE, AND OF THE ESTIMATED NUMBER HAVING THE ATTRIBUTE, FOR DIFFERENT 
PERCENTAGES OF UNITS HAVING THE GIVEN ATTRIBUTE IN THE POPULATION 
The full line shows the actual standard error of the estimated percentage, and the 
broken line the percentage standard error of the estimated number. This is equal 
to the percentage standard error of the estimated percentage. The scales shown are 
for samples of 100 and 1000. For a sample of 10,000 divide the values of the left-hand 
scale by 10, etc. 

The standard errors obtained with larger samples for which the sample 
number is a power of 10 can also be read from the figure by dividing one of 
the scales by the appropriate power of 10. Tims for a sample of 10,000 the 
scale for the sample of 100 is divided by 10, since V '(10 ,000 j '100) = 10. 

The actual standard error has its maximum value at p = 0-5. At this 
point the standard error of the estimated percentage with a sample of 100 
is 5-0, and with a sample of 1000 is 1-58, i.e. if the true percentage is 50 per 
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cent, the value of the estimated percentage will usually lie between 40 per cent. 
and 60 per cent, with a sample of 100, and between 47 per cent, and 53 per 
cent, with a sample of 1000. Expressed in percentage terms the standard 
errors and limits at this point are double the above values. As the percentage 
in the population decreases from 50 per cent, the actual standard error of the 
estimated percentage also decreases, but the percentage standard error continues 
to increase. With 100 p = 20 per cent, the actual standard error with a sample 
of 100 is 4'0 and the percentage standard error 20 per cent. ; with 100 p = 5 per 
cent, they are 2-2 and 44 per cent, respectively. Thus, while quite a small sample 
serves to verify that the proportion in a population possessing a given attribute 
is small, the determination with any accuracy of the actual number possessing 
the attribute requires a relatively large sample when the proportion is small. 

In estimating the sampling error the proportion in the population p can 
be replaced by its estimate p from the sample, i.e. by the proportion in the 
sample. This results in a certain amount of error in the estimate of variability, 
since the proportion in the sample will not in general be exactly equal to that 
in the population, but in large samples, such as are commonly met with in 
census work, this is not likely to be of much importance. Exact treatment 
tf the problem is possible by use of Table VIII.l of Statistical Tables for 
Biological, Agricultural and Medical Research. 

It must be clearly recognized that the above formulae hold only when the 
units of which the proportion possessing a given attribute is being assessed 
are themselves the sampling units, and the sample is a random one from the 
whole population. In a stratified random sample the formulae apply to each 
stratum taken separately. In other cases, e.g. multi-stage sampling, and all 
types of sampling with supplementary information, the variability no longer 
depends only on the proportions in the population or strata. Thus, for 
example, the formulse are not applicable to the proportion of farms growing 
a given crop when two-stage sampling by administrative districts, and by 
farms within selected districts, has been carried out. Equally they are not 
applicable to the proportion of individuals of a given race in a human population, 
when the sampling has been by households ; since the whole of a household 
is usually of the same race, the variability will clearly be greater than if a 
random sample of individuals had been taken. 

When the above formulse do not hold, the variability of the individual 
sampling units must be assessed in the same manner as with a quantitative 
variate, scoring the qualitative variate 1 or 0. (See Example 7.8.b.) 

Example 7.4 

Estimate the sampling errors of the estimates of Example 6. 4. a. 
We have p = 0*0738, q = 1 0-0738 = 0-9262. Hence 



S.E. (percentage defective) = 100 x 0-00281 = 0-281 
S.E. (total number defective) = S.E. (U) = 50 x 8491 x 0-00281 1190 
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Thus the percentage defective, 7-38 per cent., has a standard error of 
0-28, which implies, taking limits of plus or minus twice the standard error, 
that the true percentage defective probably lies between 6-8 per cent, and 
7*9 per cent. Similarly the number defective probably lies between 29,000 
and 33,700. Note that the standard error expressed as a percentage of the 
percentage defective or number defective, i.e. what is ordinarily called the 
percentage standard error, is 

281 1190 

S.E. % (p) = S.E. % (U) = X 100 = ^^ X 100 3-8 per cent. 



These standard errors are likely to be slight overestimates, since the sample 
was in fact systematic. 

7.5 Standard errors of functions of estimates 

If we have a number of estimates y lf y 2 , y 3 , . . . with sampling errors 
which are independent, the sampling variances being V^), V(y 2 ), V(y 3 ), 
. . . and we form a linear function of the y's : 



where the /'s are any multipliers whose values are not influenced by the 
sampling, the sampling variance of L is given by 

V(L) = ^V( yi ) + / 2 2 V(y 2 ) + /3 2 V(y 3 )+ . . . (7.5.a) 

The condition of independence is important. The sampling errors of two 
estimates will be independent if the estimates are derived from sets of values 
which are themselves independent. Estimates derived from samples of 
different populations, or from different strata of the same population, are 
consequently independent, as are estimates derived from different samples of 
a large population. Estimates derived from two different variates belonging 
to the same sampling units are not in general independent, since such variates 
are likely to be correlated, high values of the one being associated with high 
(or low) values of the other in the same sampling units. 

A number of important simple formulae are derivable from the above 
general formula. 

The standard error of a multiple of an estimate is the same multiple of the 
standard error of the estimate : 

V(/ yi )=/*V( yi ) (7.5.b) 

S.E.(/ yi )=/S.E.( yi ) 
This formula has already been used in Section 7.1. 

The standard error of the difference of two independent estimates is the 
square root of the sum of the squares of the standard errors of the estimates : 

Vto- y2 )=Vto) + Vto) (7.5.c) 

S.E. (/, - /2 ) - V[{S.E. to)} 2 + {S.E. ( y2 )}*] 
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The standard error of the sum of a number of independent estimates is the 

square root of the sum of the squares of the standard errors of the estimates : 

V(y 1 + y 2 + y3 + ...)=V(y 1 ) + V(y 2 ) + V(y 3 ) + ... (7.5.d) 

S.E. ( 7l + y 2 + y 3 +...)= V[{S.E. ( Xl )} + {S.E. (y 2 )}* + {S.E. (y 3 )P + . . .] 

which may be expressed by the rule that " variances are additive." 

The standard error of the estimate of the mean of a large population can 
also be derived from the formula. 

Weighted means are a type of linear function which occurs frequently in 
statistics. The general form of a weighted mean is 

>i 7i + 2 y 2 + * 

jw = - ; - ; - 



where the w's are the weights. Knowing the variances of the y's, the variance 
of y w can be calculated, provided the y's are independent. Two cases are of 
frequent occurrence. 

(1) V( yi )=V(y 2 )=...=V(y) 
We then have 



(2) V (yj) = A/o t , etc., where A is a constant. 
We then have 



This is the form of weighted mean which is used when we wish to obtain 
the most accurate combined estimate from a number of independent estimates 
of the same quantity whose relative variances are known. The weights are 
taken equal (or proportional) to the reciprocals of the variances, and the variance 
of the weighted mean is given by the reciprocal of the sum of the weights 
(or a multiple of this reciprocal). 

A further type of weighted mean is that in which the weights w are in the 
nature of supplementary information, the quantities y and w both being 
determined from the individual sampling units, with the variances of the j>'s 
related in some unknown manner to the w's, and y w = S(wy)/S (w). 

In order to obtain an unbiased estimate of V (y w ), whatever the variance 
law, the squares of the deviations of they fromy w must be weighted in proportion 
to w 2 before summation. For a random sample, if 

Q = Sw* (y - y w )* 
and 

V = QI( - i) 

we have 
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It may also be noted that If the variance of y for given w can^be regarded 
as constant over the range of w, and there is also no variation in the mean 
value of y for given w over the range other than that ascribable to random 
variation iny, the efficient estimate of the variance of y is given by the ordinary 
formula 

V(y) = S(y -#*/(* - 1) (7.5.H) 



and formula 7.5.e may be used to estimate V (y w ), with the introduction of 
the factor (1 -/). If the variance of y for given w is inversely proportional 
to w then the efficient estimate of the variance of y is given by 



J 

V 2 is an estimate of A, and formula 7. 5. f can be used for estimating V(J7 W ), 
with the introduction of the factor (1 -/). Either of these estimates will 
be biased if the true variance law is different from that assumed or the other 
condition does not hold. They should therefore not be used without careful 
consideration. 

The mean ratio r used in the ratio method of estimation is an example 
of a weighted mean of the above type, since r == S(y)/S (x) = S (#r)/S (*), 
and we therefore substitute r for y and x for w. This case is discussed in more 
detail in Sections 7.8-7.11, which deal with the estimation of errors in the 
ratio method in both random and stratified samples. Normally formula 7.5.g 
will be used to estimate V (r), but under certain circumstances formulae 7.5.h 
and 7.5.e may be employed. 

The approximate formula* for the standard errors of the product and the 
ratio of two estimates whose sampling errors are independent may also be 
noted. These are given by 

V (ft/a) - y 2 2 V ( Xl ) + y x V (y 2 ) (7 . 5 . j) 



Y(/2)> l 

/ 



(7 5 

( 



These formulas are only satisfactory if V^) and V(y 2 ) are small relative to 
y x 2 and y 2 2 respectively. 

If the estimates y xi y 2 , y 3 , - - are not independent the concept of covariance 
must be introduced. The covariance between two estimates is the mean 
product deviation, and is estimated in exactly the same manner as is the 
variance of each of the estimates, with the exception that the sum of squares 
of the deviations of a single variate is replaced by the sum of products of the 
deviations of the two variates. If the covariance between YI and y 2 is denoted 
by cov (y a y 2 ) the additional terms 

+ 2/ 3 Acov(y 1 y a ) + 2/!/ 3 cov (y^) + 2/ 2 Z 3 cov (y 2 y 3 ) + . . . 
must be introduced into the formula for V(L). This gives the additional 
term 2 cov (y^) in the formula for V (yj. - y 2 ) The corresponding 
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additional term in V (/ 1 y 2 ) is + tyi/z cov (yiy 2 ), and that in (yi/y 2 ) is 
2 cov (YiY^/YiYz within the bracket. 

If y-p y~, y s , ... are derived from different variates belonging to the same 
sampling units, e.g. measurements of different characters, the variance of any 
linear function L can, if desired, be estimated directly by calculating a value 
L for each sampling unit separately and estimating V (L) from these values 
in the manner appropriate to a single variate. This obviates the calculation 
of the variances and covariances of the individual variates. The same method 
can be followed with products and ratios, subject to the same limitations as 
those given above for formulas 7.5.e and 7.5.f. If the errors of a number 
of functions are required, however, it is best to calculate the variances and 
covariances (Example 7.8.b). 

The regression and correlation coefficients can be expressed in terms of 
the variances and covariance. We have the relations b = cov (xy)/V (#), 
and r = cov (xy)/V{ v (*) * v CO}- 

In the more complicated types of sampling, discussed later, the estimation 
of covariance is again exactly parallel to the estimation of the corresponding 
variances, the squares being replaced by products wherever they occur. 



Example 7.5 

Calculate standard errors for the various estimates of the regional and 
varietal differences between the yields of potatoes given in Tables 5.23.C, 
5 . 23 . d, 5 . 23 . e and 5 . 24, given that the variance of the yield per acre of any 
one variety in any one region is 4-22, and that the standard deviation is 
therefore 2-05. 

The standard errors of the regional-varietal means of Table 5.23.b are 
obtained by dividing the above standard deviation by the square roots of the 
numbers of fields. Thus for Majestic in Scotland the standard error is 
2-05/V37 = 0-34. The standard errors are shown in Table 7.5. 

TABLE 7 . 5 POTATO SURVEY : STANDARD ERRORS OF REGIONAL-VARIETAL MEANS 





Scotland 


North 


E. Midlands 


South 


West 


Majestic . 


0-34 


0-24 


0-20 


0*20 


0*24 


King Edward . 


0-32 


0-55 


0-22 


0-25 


0*31 


Great Scot 


0*48 


0-55 





0*84 


0-48 


Arran Banner . 


0-72 


0-33 


_ 


0-68 


0*38 


Kerr's Pink 


0-25 


0-34 








0-57 
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These standard errors enable the differences between the individual means 
to be examined more critically. The difference between Scotland and the 
Northern region for Arran Banner, for example, is at first sight anomalous 
being 0-12. The standard error of this difference is V(' 72 + ' 33 ) 
= 0-79. This difference, therefore, does not conflict very seriously with 
the other differences. 

On the other hand the difference between this difference and the largest 
positive difference, that for King Edward, is + 1-94 - (- 0-12) = + 2-06. 
This quantity has a standard error of V(0'32 + 0-55* + 0-72' + 0-33*) 
= 1*02. It might therefore be judged unlikely, on this evidence alone, 
that the difference has arisen by chance, since Table A.2 shows that a difference 
of 2-0 times its standard error would arise by chance in less than^l m 20 times. 
In statistical terminology the difference is significant at the 1 in 20 level of 
significance. This conclusion, however, is subject to the qualification that 
we have here picked the two extreme differences out of 10 possible pairs. A 
combined test* of all 5 differences shows that they are not exceptionally variable. 
A more comprehensive test of the differences of the whole table confirms 
this verdict. It may be noted, however, that Arran Banner is only J as common 
in Scotland as in the Northern region, whereas King Edward is 3 Crimes as 
common in Scotland as in the Northern region. The observed differences 
are therefore in the direction that would be expected if the varieties were 
grown in the regions to which they were most suited. 

The standard errors of the means of Table 5.23.C may be calculated in 
a similar manner, that for mean (a) of Scotland for example being 
iV(0'34 2 + 0-32 2 + 0-48 2 + 0-72 2 + 0-25 2 ) = 0-20. The corresponding 
value for the Northern region is 0-19. The standard error of the estimated 
difference 8-55 - 7-46 = + 1-09 is therefore V(0'20 + 0-19 2 ) = 0-28. 
Similarly the standard error for the corresponding difference of the means (b), 
8-98 - 7-37 = + 1-61, is 0-38. 

Table 5.23.d provides an example of a weighted mean with the weights 
so chosen that the most accurate combined estimate is obtained. Formula 7 . 5 .f 
is therefore appropriate, and A represents the variance of a single field, i.e. 
I = 4-22. Hence the variance of the weighted mean difference = 4-22/74 
== 0*0570, and the standard error is therefore 0*24. 

The relative efficiency of the above estimates of the differences may be 
assessed from the ratio of the reciprocals of the variances (Section ^8.1). 
Assigning a value of 100 to the weighted mean, the relative efficiencies of 
means (a) and (b) are 73-5 and 39-9.f 

Finally we may evaluate the standard errors of Table 5.23.e. These 
cannot be evaluated exactly, as the pooling of regions is based on the assumption 
that there are no differences of any importance between these regions. In so 

* The weighted sum of squares of deviations gives x* == 5-74 with 4 degrees of 
freedom. 

t These values do not represent the true efficiencies, which are obtained by assigning 
the value of 100 to the most efficient possible method, here that of Table 5.24. 
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far as this is not the case additional errors will be introduced, and the standard 
errors calculated on the assumption that there are no differences will therefore 
be underestimates of the true errors. 

The standard errors of the means of the pooled regions can be calculated 
from the numbers of fields on which each mean is based. Thus that for 
Majestic is V( 4 ' 22 /356) = 0-11. The standard errors of the Scottish 
means have already been given in Table 7.5. The standard error of the 
weighted mean for Majestic is therefore given by 



X 0-1P+1 2 X034 2 
5* 



\ 
J = 



0-11 



Similarly the standard errors for the other four varieties are found to be 
0-13," 0-29, 0-24 and 0-24. 

The standard errors of the estimates obtained in Table 5.24 can only 
be calculated exactly by inversion of the matrix of the simultaneous linear 
equations giving the least-squares solution. This requires a good deal of 
arithmetical work. The method is explained, for example, in Statistical 
Methods for Research Workers, Section 29. 

In material of this type, however, there will rarely be any need to determine 
the standard errors exactly. An upper limit to the standard error of any 
particular difference can be obtained by calculating the standard error of the 
estimate given by Method (3) of Section 5 . 23. A lower limit can be obtained 
by calculating what the standard error would be if there were no cross 
classification, and if the relevant variance per unit were that within sub- classes. 
The value of this latter standard error for the difference between Scotland and 
the Northern region, for example, is 



The value of the standard error for Method (3) has already been found to be 
0-24. In this case close limits are set to the true standard error. 

7.6 Stratified random sample with possibly unequal variances witMn 
strata 

Since in a stratified sample differences between the sampling units in the 
different strata are eliminated from the sampling error, in estimating this error 
we require not the total variance of the sampling units over the whole population, 
but the variances of the sampling units within the different strata. 

A simple example will illustrate the difference. Suppose we have a large 
population of which 25 per cent, of the units have the value 8, 50 per cent. 
have the value 10, and 25 per cent, have the value 12. The mean of the 
population is 10, and 50 per cent, of the values have a deviation of from the 
mean, the remaining 50 per cent, having a deviation of 2. The mean 
square deviation or total variance is therefore 0*5 X + 0-5 X 2 2 = 2. 
Suppose now the population is divided into two strata, the first containing all 
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the units with value 8 and one-half the units with value 10, the second one- 
half the units with value 10 and all the units with value 12. The mean of 
the first stratum is 9, and all units in it have a deviation of L The mean 
square deviation or variance within the first stratum is therefore 1. The same 
holds for the second stratum. 

In this example the strata are of the same size and the within-strata variances 
are equal Examples can easily be constructed in which this is not the case, 
but even if the variances are unequal an average within-strata variance can be 
calculated, and in a large population this will always be less than the total 
variance if there are differences between the strata means. 

If the sample numbers in all strata are sufficiently large for the within-strata 
variances per sampling unit to be separately estimated, the sampling variances 
of the means or totals of the individual strata can be estimated separately, and 
the sampling variance of the population mean or total, which is a linear function 
of these means or totals, can be obtained by the use of the formulae of 
Section 7.5. 

This method is valid even if there is inequality in the within-stratum 
variance per sampling unit from stratum to stratum, and is applicable to^all 
types of stratification, including stratification with a variable sampling fraction 
and stratification after selection. 

In general it is best to build up the variance of the population estimate 
under consideration by calculating the variances of the component parts, and 
adding these variances, or the correct multiples of them, the same steps being 
followed as in the calculation of the estimate itself. We will therefore ^not 
give formula for the variances of all the different estimates set out in Sections 
6.4 and 6.5, but will illustrate the derivation of such formulas by obtaining 
that for V (y) in the case of a variable sampling fraction. 

We have y = {# Si 0>)}/N. If a/ 2 is the variance within the fth stratum, 
V{Si(y)} =wz <jf 2 (l //), and hence by formula 7. 5. a 

V(y) - J%gfm*?(l -/*)}/N (7.6.a) 

For V (y) the of will be replaced by their estimates sp. 

If all the sampling fractions are equal we have, since N = gn, 

V (y) = (1 -/) 25 (m s/ 2 )//* 2 (7 .6 .b) 

The examples which follow will illustrate the details of the methods to be 
followed in the cases that are ordinarily met with in practice. 

If we require the sampling errors of estimates applicable to domains of 
study which cut across the strata the situation is more complicated. If a dash 
is used to denote the domain in question, and if s/ 2 is the estimated variance 
per unit between the units of this domain in stratum /, and the proportion 
of the selected units of stratum i not in the domain is g/, so that #/ = (m ni)jm y 
estimates of the variances of the total and mean of the domain are given by 
V (YO = 2gt*t (1 -ft) {ni'qi'yi'* + (nt'-l) * /2 }/(, - 1) (7.6.c) 



Consequently, in the case of a stratified sample with uniform sampling 
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fraction, an approximate estimate of the sampling error of a domain mean 
will be obtained by treating the sample as if it were stratified for the domain 
in question, and not stratified for the strata which cut across the domain. 
(See Example 7.7.b.) 

In the case of a variable sampling fraction the above formulae must be 
used. In addition the variance of the difference of the means of two domains 
will be somewhat increased by covariance. It may be noted that, although 
the variances of estimates for such domains may be considerably increased by 
lack of control by stratification, a variable sampling fraction will still be 
advantageous in the appropriate circumstances. The optimal sampling fractions 
will be dependent on the estimates required, being approximately proportional 
to the square roots of !/($ 1) times the quantities in the curly brackets.* 

Example 7 .6. a 

Estimate the sampling errors of the estimates of Example 6.5. 

The computations for acreages are set out in Table 7, 6. a. The various 
steps are as follows. 

The sums of squares of deviations from each stratum mean, Si (y j>;) 2 > 
are first calculated from the sample values given in Table 6. 5. a, using the 
method of Example 7.1. a. The estimated within-strata variances sp are 
then calculated by dividing by m 1. Multiplying these by (1 /)/* gives 

TABLE 7. 6. a ESTIMATION OF SAMPLING ERRORS OF THE WHEAT ACREAGES OF 

EXAMPLE 6.5 



Size- 
group 


tf< 


n t ~l 


$.(y-yi}* 


$? 


V(y f ) 


S.E.(y<) 


V{S,(y)} 


V(Y,') 


1- 


22 


21 















6- 


26 


25 


47 


1*9 


069 


0-26 


47 


19,000 


21- 


18 


17 


191 


11-2 


593 


0-77 


192 


76,000 


51- 


26 


25 


4,051 


162-1 


5-92 


2-43 


4,004 


1,595,000 
















i 


151- 


20 


19 


13,899 


731-5 


34-75 


5-89 


13,898 j 5,560,000 








! 










301- 


13 


12 


23,370 


1947-5 


142*3 


db 11-93 


24,052 


10,069,000 




125 


119 










42,193 


17,319,000 



the variances of the strata means V (y;), and taking the square roots gives 
the sampling standard errors S.E. (y f ) of these means. The means themselves 
are tabulated in Table 6.5.b. 

* The sampling errors of contrasts between domains are more fully discussed in 
Sections 9.2-9.4. 
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To obtain the sampling errors of the population estimates we must bear 
in mind the method by which these estimates were arrived at. The estimate Y 
of the population total was obtained by multiplying the sample total by the 
raising factor. We therefore require the variance of the sample total 2011. 
This will be equal to the sum of the variances of the strata totals. These 
variances V{Sz(j>)} are obtained by multiplying $f by ni(lf). The 
square root of the sum of the variances is then taken and multiplied by 20. 
Thus 

S.E. (Y) = 20 x V42,193 = 20 x 20541 == 4110 
Similarly 

S.E. (x) = 20541/125 = 1-64 

In the case of Y' each V (yt) is multiplied by the square of the number of 
farms in the size-group, given in Table 6.5.b. Thus 

142-3 x 266 2 = 10,069,000 
Taking the square root of the sum, 

S.E. (Y') = yl7,319,000 = 4160 
and hence 

S.E. (y') = 4160/2496 = 1-67 

It will be noted that although Y' is slightly more accurate than Y the 
standard error given by the above calculation is slightly greater. There are two 
reasons for this. In the first place in calculating S.E. (Y) we have neglected 
the errors introduced by the use of a working sampling fraction. The average 
errors from this cause are equivalent to errors introduced by rounding off 
the numbers in the different strata of the sample to whole numbers. In the 
second place, use of the exact sampling fractions will result in slightly different 
contributions to the error variance from the different strata, which may result 
in raising or lowering the estimate of the standard error. 

The computations for number of farms growing wheat follow the same 
pattern, with the exception that the variance of each size-group total V (HI) 
is estimated from the proportion of farms growing wheat in that size-group. 

TABLE 7.6.b ESTIMATION OF THE SAMPLING ERROR OF THE NUMBER OF FARMS 

GROWING WHEAT FROM EXAMPLE 6.5 



Size-group 
(acres) 


* 


% 


P. 


Vfa) 


1-5 


22 











6-20 


26 


1 


03846 


913 


21-50 


18 


5 


-27778 


3-431 


51-150 


26 


21 


-80769 


3-837 


151-300 


20 


16 


80000 


3-040 


301- 


13 


11 


84615 


1-608 



12-829 
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The calculations for S.E. (U) are given in Table 7.6.b. For the largest size- 
group, for example, 

V(*0 = 13 X 0-84615 X 0-15385 x 19/20 1-608 

We then have 

S.E. (U) = 20 X V12-83 = 71-64 

If the sampling errors of the different strata means are not required, the 
above calculations can be simplified slightly by omitting the factor (1 /) 
till the final stage of the computation. 

Example 7.6.b 

Estimate the sampling errors of the estimate of wheat acreage and number 
of farms growing wheat obtained by stratification after selection of the random 
sample of Example 6.6. 

The computations follow the same lines as those for S.E. (V) in Example 
7 . 6 . a. The values obtained are : 

S.E. (Y') = 4320 acres 
S.E. (U') = 75-2 farms 

The value for S.E. (Y') is slightly greater than that for the similar estimate 
of Example 7 . 6 . a. The difference, however, is not a precise measure of the 
relative accuracy of stratification before and after selection. In the first place, 
since different samples are involved, there are differences in the estimates of 
the within-strata variances. In particular, the estimate of the variance for 
size-group 301- is much greater in the random sample because of the one 
high value 265. These differences are merely errors of estimation and are 
not a reflection of the relative accuracy of the two samples. On the other hand, 
the two largest size-groups, which have the highest variances, happen by 
chance to have more than the proportionate number of farms in the random 
sample, and this particular random sample will therefore tend to give a more 
accurate estimate with stratification than will a stratified sample. On the 
average, however, random samples stratified after selection will give slightly 
less accurate values than stratified samples. 

7.7 Pooled estimate of error : the analysis of variance 

In a stratified sample with equal sampling fractions all the ft are equal 
If in addition the within-strata variances a/ 2 are equal, by putting S (nf) = n 
we obtain the simplified formula 



This is the same as the formula for a random sample, with the exception that 
<j is replaced by GI, the common within-strata standard deviation. 
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The most accurate estimate of ^ will be obtained if the estimates of a; 2 
from the various strata are weighted according to the numbers of degrees of 
freedom on which they are based. This is equivalent to adding the sums of 
squares of deviations from the strata means and dividing by the sum of the 
associated degrees of freedom, i.e. by n t. 

It Is worth noting here an identity which relates the above sum of squares 
to the sum of squares of deviations from the general mean. This is 

E Si (y -#)* + S m (yt - jO a - S(y -y)* 
This is easily verified if we recognize that 

S**(# - JO 2 == S {yi Si(y)} -yS(y) 

The first term on the right-hand side is the sum of the products of the means 
and totals of the separate strata, i.e. the " corrections for the means " for the 
separate strata, and the second term is the " correction for the general mean." 
The arithmetical computations can be conveniently arranged in the form 
of what is known as the analysis of variance. This is shown schematically in 
Table 7. 7. a. 

TABLE 7. 7. a ANALYSIS OF VARIANCE BETWEEN AND WITHIN STRATA 

Degrees of Sum of Mean 

freedom squares square 

Between strata . . t - 1 S y< Si (y) - y$ (y) A 

Within strata . . n- t 2 S (y yd* 



Whole sample . . n - 1 S(y z ) - y$(y) 

The most convenient form of computation, at least when the numbers 
in the different strata are small, is to calculate the sums of squares for the 
whole sample and between strata, and obtain the sum of squares within strata 
by subtraction. The mean squares are then obtained by division by the degrees 
of freedom. Only the mean square within strata, y 3 2 , is required for the present 
purpose. The mean square for the whole sample approximates closely to an 
estimate s 2 of the variance per unit that would result from random sampling 
of the whole population. The interpretation of the mean square between 
strata, A> is discussed in Section 8.10. 

A further simplification is possible when each of the strata contains only 
two sampling units. In this case the sum of squares within strata can be 
calculated directly from the differences of the y's of the pairs of units within 
each stratum. If these differences are denoted by d, the sum of squares will 
be \ S (d 2 ), and consequently, since there are t differences each contributing 
one degree of freedom, 



The analysis of variance has many applications, not only in sampling but 
in other fields of statistics. As its name implies, it provides a way of determining 
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the different components of variance to which a given type of material is 
subject. As such it is of particular value in investigations of the efficiency of 
different types of sampling. Its uses in this connection will be explained in 
Chapter 8. 

The above discussion has been based on the assumption that the within- 
strata variances are equal. In practice this is not likely to be exactly true, 
though the data at our disposal may be insufficient to determine the true 
variance law with any accuracy. It is therefore important to ascertain what 
is the position if a pooled estimate of variance is used when the variances are 
in fact unequal. 

If all the sampling fractions are equal we have, from equation 7.6.b, 



The second factor is a weighted mean of the sf, with weights equal to nt. In 
the pooled estimate of error described above s x 2 is a weighted mean of the 
st z with weights proportional to m 1. Unless the numbers in the different 
strata are very small, and associated in magnitude with cr/ 2 , there will be little 
difference between the two estimates. Use of the pooled estimate of error 
in the estimation of the error of the population mean and total will not, 
therefore, introduce any serious disturbance when the sampling fractions are 
equal, even when the within-strata variances are very unequal. On the other 
hand, the use of a pooled estimate to determine the errors applicable to the 
mean or total of part of the population, e.g. a single stratum mean, may be very 
misleading, and it is therefore best, when there are marked differences in the 
within-strata variances, and the numbers in the different strata are not too 
small, to keep the error estimates separate, as has been done in Example 7. 6. a. 

There is, of course, nothing sacrosanct in weighting by the degrees of 
freedom ; these weights merely give the most accurate estimate when the 
variances are equal, and enable the analysis of variance technique to be used. 
If the m are too small for separate estimates of the within-strata variances 
to be of value, we can still use a pooled estimate, weighting by m if this appears 
advisable. 

The situation is completely different with a variable sampling fraction, 
In this case equation 7. 6. a shows that weights proportional to gt*m(l ft) 
are required. The pooled estimate of variance may therefore be decidedly 
misleading, even with quite small differences in the within-strata variances. 
Consequently in this case separate estimates with proper weighting should 
always be used. 

Example 7.7. a 

Estimate the sampling errors of the estimate of the total wheat acreage 
and number of farms growing wheat from the stratified systematic sample 
with a variable sampling fraction of Example 6.7. 
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Since the systematic selection was from a list arranged by districts the 
sample will substantially be stratified by districts as well as size-groups. ^ The 
numbers in the 6-20 and 21-50 size-groups are too small for the district 
stratification to be effective, but for the larger size-groups the within- districts 
variance is required, instead of the overall variance within the size-group. 

The available number of degrees of freedom in each district and size- 
group is small, and inspection of the values of Table 6. 7. a shows that there 
are no marked differences in variability in the different districts. We may 
therefore appropriately use a pooled estimate of variance for each size-group. 

The district totals and means for the largest size-group are shown in 
Table 7.7.b. 

TABLE 7.7.b DISTRICT TOTALS AND MEANS FOR SIZE-GROUP 501- 

District: 1 2 3 4 5 6 7 All 

No. of farms 1429010 17 

Total , .114 487 315 1937 72 2925 

Mean . . 114 121-75 157-5 215-2222 72 172-0583 

The analysis of variance is given in Table 7.7.c. The sum of squares 
between districts is 114 X 114 + 487 x 121-75 + ... - 2925 X 172-0588 
= 40,698. The total sum of squares is 114* + 119 2 + 107 2 + . . . - 2925 
X 172-0588 = 72,067. Subtraction gives the within-district sum of squares, 
and division by the degrees of freedom the mean squares. 

TABLE 7. 7. c ANALYSIS OF VARIANCE BETWEEN AND WITHIN DISTRICTS OF THE 

WHEAT ACREAGES OF SIZE-GROUP 501- (DATA OF TABLE 6. 7. a) 

Degrees of Sum of Mean 

freedom squares square 

Between districts . . 4 40,698 10,174 

Within districts . . 12 31,369 2,614 

Whole size-group . . 16 72,067 4,504 

The within-district mean square is substantially less than the overall mean 
square, indicating the greater similarity of farms within a district and consequent 
gain in accuracy by stratification. 

The size-groups 51-, 151-, and 301- can be analysed in the same manner. 
There is some difference in mean squares for size-group 301-, but little 
difference for the other two size-groups. For size-groups 6- and 21- the 
overall variability within size-groups can be taken. 

The remainder of the computations are set out in Table 7.7.d. They 
follow the same lines as Example 7. 6. a, with the exception that the factors 
(1 //) are different, and that the variance of each group total must be 
multiplied by the square of the raising factor for that group, and the resultant 
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TABLE 7.7.d ESTIMATION OF SAMPLING ERRORS OF THE ESTIMATES 

OF EXAMPLE 6.7 

Size-group 
(acres) s<* V{5,(^)} gf V(Y,) 



1-5 .... .... .... _ 

6-20 30 40,000 

21-50 6 53-5 320 3,600 1,152,000 

51-150 26 159-2 3,930 400 1,572,000 

151-300 40 564-2 20,310 100 2,031,000 

301-500 4:3 1,703 58,580 25 1,464,000 

501- 17 2,614 29,630 9 267,000 



135 6,486,000 

variances added, in accordance with formula 7. 5. a. The estimated standard 
error of the total acreage is thus V 6 >4:86,000 = 2550. 

It will be noted that the acreage of wheat in the smallest size-group has 
been assumed to be zero, and that the estimated zero error variance of the 
second size-group is based on only two degrees of freedom, and is therefore 
very inaccurately determined. It is clear, however, from the nature of the 
material and the trend of the variances in the larger size-groups that this variance 
must be small. 

In the computation of the standard error of the number of farms growing 
wheat, allowance should also strictly be made for the stratification by districts. 
If the number of farms in each size-group district sub-class were large this 
could be done by calculating the variance of each size-group total of farms 
growing wheat by the method of Example 7. 6. a. The numbers in many of 
the sub-classes are so small, however, that the approximation resulting from 
using the estimated proportions p to calculate the variances will be unsatisfactory. 
In this case it will be sufficient to ignore the district stratification, calculating 
the variance of each size-group total of farms growing wheat from the proportion 
in that size-group, and then proceeding in the same manner as for wheat 
acreage. The resultant standard error will be found to be 88*9. 

Example 7 .7 .b 

Estimate the sampling standard errors of the regional and varietal means 
of the yields of potatoes given in Table 5. 23, a, and compare them with the 
standard errors already obtained in Example 7.5. 

As mentioned in Section 5.23, the sample can be regarded as stratified 
by regions (but not by varieties). The regional standard errors are therefore 
derived from the analysis of variance within and between regions. This is 
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given by lines (1), (4) and (5) of Table 7.7.e. The required standard errors 
are therefore V(5 -25/174) = 0-17, etc. 



TABLE 7.7.e POTATO SURVEY: ANALYSIS OF VARIANCE OF YIELDS PER ACRE 





Degrees of 
freedom 


Sum of 
squares 


Mean 
square 


Between regions (1) 


4 


173-7 


43-42 


(Between varieties (2) 
Within varieties (3) 


16 

880 


987-7 
3713-6 


61-73 
4-22 


Total (4) ... 


896 


4701-3 


5-26 


Total (5) 


900 


4875-0 


5-42 


Between varieties (6) ... 


4 


887-3 


221-82 


(Between regions (7) 
Within regions (8) 


16 

880 


274-1 
3713-6 


17-13 
4-22 


Total (9) ... 


896 


3987*7 


4-45 


Total (10) 


900 


4875-0 


5-42 



The mean square within regions, 5-25, is 1-24 times the mean square 
within regions and varieties, 4*22, already given in Example 7 . 5. This latter 
mean square is obtained from an analysis of variance within and between 
the regional-varietal groups, lines (2) and (3). This would have been the 
appropriate mean square for estimating the errors of the regional means if 
the sample had been stratified by regions and varieties. 

The exact standard errors of the varietal means cannot be obtained by any 
simple process. If the sample were fully random, and not stratified by regions, 
the correct estimate would be that given by the within-varieties component 
of variance, i.e. by treating the sample as if it were stratified by varieties. 
Stratification by regions will reduce the sampling error of the varietal means 
slightly, but not to any great extent. Consequently the estimate obtained by 
stratifying by varieties and not by regions will be somewhat of an overestimate 
of the true standard error. 

The analysis of variance within and between varieties (ignoring regions) 
is given by lines (6), (9) and (10) of Table 7.7.e. Approximations to the 
varietal standard errors are therefore given by -\/(4'45/393) = 0*11, etc. 

It should be noted that although the sample is stratified by regions the 
component of variance due to regions must not be eliminated when calculating 
the standard errors of the varietal means. This must only be done if the sample 
is stratified by both varieties and regions. The reason is as follows. If the 
sample is stratified by both varieties and regions the proportions of fields from 
each region in each varietal mean will be exactly equal to the proportions for 
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that variety in the country. Hence only variation between fields of the same 
variety within each region contributes to the error. In the present case, 
however, the proportions of fields from each region in each varietal mean do 
not correspond exactly to the proportions in the country. The deviations are 
in fact only slightly less than would be obtained in a random sample. 

Those familiar with the use of the analysis of variance in replicated 
experiments may wonder why two separate analyses are required, instead of 
the single analysis analogous to the partition of the degrees of freedom of a 
complete 5x5 table into 

Regions ... ... 4 

Varieties ... ... 4 

Regions X varieties 16 

The reason is that regions and varieties are not orthogonal, owing to the differing 
numbers of fields in the different sub-classes. The analysis of variance of 
non-orthogonal material is inherently more complicated, and in particular the 
interaction component, regions X varieties, can only be obtained by rather 
elaborate calculation (Yates, 1934, A). It is not given by the subtraction of 
the sums of squares for regions (1) and varieties (6) from the sum of squares 
for all regional-varietal sub-classes, (1) and (2), or (6) and (7). 

All the components of variance due to the regional-varietal classification 
must be eliminated when calculating the sampling standard errors of varietal 
differences freed from regional effects, or of regional differences freed from 
varietal effects, since we are then concerned only with the component of 
variance within sub-classes. These standard errors have already been discussed 
in Example 7.5. The within-sub-classes component can be obtained by splitting 
the sum of squares within each region into between and within varieties and 
pooling the components so obtained, as in lines (2) and (3) of Table 7.7.e ; 
by doing the same for regions within varieties, as in lines (7) and (8) ; or by 
making a direct analysis between and within regional-varietal sub-classes. 
All three processes are equivalent arithmetically, and give the same sum of 
squares (880 degrees of freedom) within sub-classes. 

The reader should calculate for himself the various sums of squares (other 
than the total sum of squares) given in Table 7.7.e. These can be obtained 
from the data of Table 5.23.b. For this purpose the means require to be 
recalculated to a greater number of decimal places than those given in 
Table 5.23.b.* 

It will now be seen that three separate variances are relevant to the 
calculation of the standard errors appropriate to the various estimates we have 
obtained from the data of Table 5 . 23 . b. There is also a fundamental difference 
in the nature of the errors. Those appropriate to the regional means and 
varietal means are genuine sampling errors. They answer the question : 

* There are one or two minor last-place discrepancies between the means and totals 
given in Table 5.23.b. These are due to reconstruction of the data from a table in 
a report. 
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given the existing distribution of varieties in the different regions,^ by how 
much may the sample regional (or varietal) means be expected to deviate from 
the corresponding means over all fields ? If the sampling fraction were large 
the correction for finite sampling would require to be made. 

In evaluating the errors of the estimates of the regional and varietal 
differences freed from the effects of the other factor, we are concerned with the 
more general question : how far are the estimated differences likely to be in 
error due to chance variations between the fields on which the varieties are 
grown ? This is not a sampling error. It would still exist if the data collected 
represented all the potato fields in the country. The correction for finite 
sampling should therefore not be applied. 

It should be noted also that this latter estimate of error is based on the 
assumption that the distribution of the varieties over the different fields within 
a region is random, and that conditions of growth, etc., are also randomly 
distributed. This may be far from the truth. If variety P is regarded, 
rightly or wrongly, as particularly suitable for poor soils, it will tend to be grown 
on poor soils, and its yield will be less for this reason. A new variety will tend 
to be grown by the more progressive farmers, and may in consequence give 
higher yields, even though no better than the older varieties. Consequently 
the estimates of error merely provide lower limits to the real errors ; in other 
words they represent the errors attributable to the residual random component 
of variance only. Consequently, as already emphasized in Section 5.23, all 
conclusions must be tentative. Only experiments can give definite answers. 

7.8 Ratio method : random sample 

In order to calculate the variance of r the correlation between the values 
of x and y for the same sampling unit must be taken into account. From 
formula 7.5.k, with the additional covariance term and allowance for finite 
population, we obtain for a random sample 



This is an approximate formula, but is accurate enough for practical purposes 
in the cases met with in sampling. 

The formula can be put in an alternative form, which somewhat simplifies 
the approach to more complicated cases such as stratified samples. If we 
denote by Q the sum of the squares of the deviations of the /s from the values 
given by the ratio line (OMD of Fig, 6.8), we have 



-50 -' (*- 
- 2r S(*y) + 
y)* ~ 2r S(* -*) (y ~ 
the last two expressions being those which are suitable for computation. 
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If we now take Sq* to represent the estimated mean square deviation from 
the true ratio line, we have 

*fl=fi/(-i) 

We then find 



The first of the above formulas is equivalent to formula 7.5.g. 

The analogy with the case of a random sample without supplementary 
information can now be seen. Apart from the factor x 2 /# 2 , which will be 
approximately unity, the only difference is that the sum of the squares of the 
deviations of y from the mean y of the sample is replaced by the sum of the 
squares of the deviations of y from the ratio line of the sample. 

The variance of a standardized estimate y will be obtained by replacing 
x by x in the above formula. 

In the case of two-phase sampling the first-phase estimate >q of x will 
be subject to sampling errors. If the sampling is random for both phases the 
variance of y will be 



where the suffices refer to the phases, sf is calculated as above from the 
second-phase units, and $ 2 is the total variance of y, also calculated from the 
second-phase units. 

The first term of the above formula will be recognized as the sampling 
variance of y due to the second-phase sampling of the first-phase sample 
(regarded as without error), while the second term is the first-phase sampling 
variance of y, i.e. the variance which would be obtained if y were determined 
for all units of the first-phase sample. This subdivision of the variance provides 
a general method of obtaining the errors of two-phase sampling. In certain 
types of sampling the circumstance that values of y are not available for all 
the first-phase units introduces complications into the estimation of the second 
component of variance which will be dealt with in Section 8.7. 

It frequently happens that the available supplementary information is out 
of date or otherwise subject to error. If, however, values of x are known for 
all units of the population these values can be used in the calculation of % from 
the selected units. If this is done bias will be avoided, and the effect of these 
errors on the final estimates will be correctly assessed, provided the original 
frame is complete. If the frame is not complete the sampling divides into two 
parts : that covering units included in the original, frame, for which the ratic 
or regression method of estimation can be used ; and that covering units not 
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so included, for which the appropriate method of estimation without 
supplementary information will be required. 

It may also be noted that if the variance V* (r) of r for fixed x is constant 
over the whole range of values of x, and if r itself exhibits no trend over this 
range, V (r) may be estimated from formulas 7.5.h and 7.5.e, substituting 
x for w and r for y. This method of estimation has the advantage of saving 
computation in cases in which the values of r are directly available while those 
of y are not, but it will give a biased estimate of error if the above conditions 
are not fulfilled. An example of the method is given for a stratified sample 
in Example 7.17. 

If V x (r) is virtually constant, the sampling error of any ratio estimate can 
be rapidly calculated once the value of V* (r) has been established, since only 
S(x) and S (x z ) require to be known, formula 7.5.e being used. Similarly, 
if V x (r) is inversely proportional to x, i.e. equal to A/#, formula 7.5.f can be 
used, only S (x) being required. The effective constancy of A, and its value, 
can be most simply established by calculating V (r) in the ordinary manner 
for various batches of data and calculating the resultant values of A from 
formula 7.5.f. This is in general preferable to using formulae 7.5.1 and 
7.5.f directly. 

Example 7. 8. a 

Estimate the sampling error of the ratio estimate of the acreage of wheat 
from the random sample of farms (Example 6. 9. a). 

We have 

S (/) 207,261 S (xy) = 902,958 S (x*) = 5,061,734 
r = -1522430 2F = -3044860 r 2 = -0231779 

Q = 49,643 ^ 2 = 400-35 

S.E. (r) = j^Yj V{(1 - 1/20) 125 X 400-35} = 0-01443 

XOj J. J. ~r 

S.E. (Y) = 273,074 x 0-01443 = 3,940 

Example 7.8.b 

Estimate the sampling error of the estimates of Example 6.9.b, 

We have 

72 = 43 / = 43/325 = 0-1323 g = 7 -5581 

S (y) = 799 y = 18-5814 x = 79-6977 S (x) = 3,427 

S (y*) = 22,065 S (xy) = 76,965 S (r 2 ) = 328,323 

yS(y) = 14,846-5 y S (x) *= 63,678-5 $S(x)=* 273,124-0 

S(y jyja = 7218-5 S(x x)(y y] = 13,286-5 S(x -x)* = 55,199-0 

r = 0-233149 2r = 0-466298 r 2 = 0-0543585 
Q = 4023-6 yf = 95-80 

S.E. (100 r) = - V{08677 X 43 x 95-80) = 1-744 
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Had the sample been a random sample of individuals the formula of 
Section 7 A would have been applicable, giving 

S.E. (100 r) = 100 V(O8677 X 0-2331 X 0-7669/3427) = 0-673 

The large difference between these two standard errors is an indication of the 
additional -variability between kraals, and illustrates the misleading results 
that may be obtained by using the formula of Section 7 . 4 when the sampling 
units consist of groups of individuals and not single individuals. 

Since the total number of persons in the reserve is unknown, the standard 
errors of the total numbers are derived from the formulae appropriate to a 
random sample without supplementary information. We therefore have 

S (y - yTl(n - 1) = 171-87 S(x xf\(n - 1) = 1,314-3 
S.E. (X) = 7-5581 x V(<>8677 X 43 X 1314-3) = 1,673 
S.E. (Y) = 7-5581 X V(0-8677 X 43 X 171-87) = 605-3 

The standard error of the number present in the reserve can be calculated 
in the same manner from the sum of the squares of the deviations of (x y), 
which in turn can be calculated directly from the separate values of (x y). 
In the present case, where the separate values of (x y) are not tabulated, 
and where S (x x) (y y) has already been calculated, it is more convenient 
to obtain the required sum of squares of deviations from the sums of squares 
and products already calculated (see Section 7.5). Thus 

S {(x -y) - (* ~y}Y = 55,199-0 + 7,218-5 - 2 x 13,286-5 = 35,844-5 

S.E. (X Y) = 7-5581 x V{0-8677 X 43 x 35,844-5/42} = 1349 

Note that x and y are not independent, being derived from the same 
sampling units, and therefore V (X Y) is not equal to V (X) + V (Y), but 
to V (X) + V (Y) - 2 cov (XY). Putting cov (XY) equal to 13,286-5/42 gives 
the same result as above. 

7.9 Ratio method : stratified sample with uniform sampling fraction 

(a) When the ratio is assumed to be the same for all strata : 

Instead of taking the sum of squares of deviations from the general ratio 
line, the deviations from a series of lines parallel to this line and passing through 
the points representing the strata means must be taken, the divisor n 1 
being replaced by n t. 

Thus 



- ytf - 2r Si (y -#) (x - ft) + r* S St (x - ft) 2 

tf = Ql(* - 1) 

The sums of squares and products will be recognized as the sums of squares 
and products within strata, similar to those already obtained for the y variate 
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in the pooled estimate of error for a stratified sample without supplementary 
information. 

(b) When the ratio is permitted to assume different values for the different 
strata : 

The common r is replaced by r/ corresponding to the different strata, 
so that 

Q = 2 Si (y -yt) z - 2S ft St (y - yt) (x - x{) + S r * St (x - */) 
The divisor n t stands. 

In this case the contribution to Q from each stratum is best computed 
separately. If desired the variances of the contributions to Y from the different 
strata may also be estimated separately. This course is equivalent to assigning 
slightly different weights to the different contributions to Q, the situation 
being analogous to that already discussed in Section 7.6. 

Note that if the population totals X/ are not known for the different strata 
but the total X for the whole population is known, the formulae for case (a) 
must be used for calculating the sampling errors of r and Y, even if the ratio 
clearly varies from stratum to stratum, since the method of estimation must 
be that corresponding to case (a). 

Example 7.9 

Estimate the sampling errors of the estimate of Example 6.10. 

The contributions to Q from the six districts are : 

District Qi District Qt 

1 .. 5,107-59 4 .. 20,566-56 

2 .. 1,550-71 5 & 7 .. 3,737-14 

3 .. 7,963-98 6 , 1,080-92 



TOTAL 40,006-90 
Hence 



7.10 Ratio method : stratified sample with variable sampling fraction 

If the ratio is assigned different values in the different strata the variance 
of tfc$ estimated total is 



= {gf (1 /jf) m s q f} approximately, 
V (r) = V (Y)/X 2 
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Here the % 2 are estimated separately for each stratum, using the value of the 
ratio appropriate to the stratum and the divisor m 1 for Qt. 

If the ratio is assumed to be the same for all strata the same formula may 
be used, with the exception that the Qt are calculated using the general ratio, 
with divisors m - 1 as before. 

For an illustration of the application of these formulae see Example 7.17. 

7.11 Ratio method : integral values of the supplementary variate 

When the supplementary variate x can only assume small integral values 
the above formulae for Q can be simplified by classifying the data according 
to the value of x. The most common instance in censuses and surveys is in 
surveys of human p pulations in which the sampling units are households and 
information is required on individuals. 

In the analysis of data appertaining to individuals the working unit in 
the analysis will commonly be the individual, although the sampling unit is 
the household. Clear distinction must therefore be made between the values y 
for the sampling units which in households of two, for instance, will consist 
of the totals of pairs of individuals, and the values for the individuals. These 
latter values we will denote by z, with the convention that [z] for families 
of more than one unit represents the total of the individuals in this family, 
so that [z] equals y. With this notation r = z. Suffices will be used to indicate 
size of family ; n l9 2 > to denote the numbers of families of the different 
sizes. 

No difficulty should be found in transforming the formulas for Q into 
a form suitable for computation. In the case of a random sample, for instance, 
we find 



l (z) + 2S t (*)+...} 

-f- z* fa + 4 2 + . . . ) 

= !$!(*- 50 2 + S 2 ([z] - 2* 2 ) 2 + . . . + i (*i - #) 2 

+ 4n 2 (5 2 _*)+... 

It will be noted that in order to calculate Q the quantities [z] are required. 
In the event of the survey material being recorded on punched cards, each 
individual will normally be assigned to a separate card. The required totals 
can then be obtained on the tabulator by sorting for family designation, family 
size, and any stratification which is required, and controlling on family 
designation, either printing the totals or reproducing them directly on to new 
family cards. The latter procedure will be advantageous if further analysis 
is required on the characteristics of families regarded as entities. 

This type of analysis can be confined to a special type of individual, such 
as adult males. If punched cards are used a card count will have to be 
introduced to count the number of the special type occurring in each family, 
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which may be termed the " partial size " of the family. There is then the 
minor complication that families of different partial sizes will occur together 
in the tabulated results. This is overcome if the results are reproduced on 
cards, as the " partial sizes " can be punched on the cards, and the cards 
subsequently sorted by partial size and listed. 

The last of the above forms for Q separates the various components of 
variance. The first term gives the contribution to Q arising from variation 
between families of one, the second that between families of two, etc., and the 
first term of the second set gives the contribution due to the average deviation 
of families of one from the general mean, etc. If the sample were stratified 
for size of family the second set of terms would be omitted. They will also 
be omitted if the error of a mean standardized for distribution of family size 
is required. 

The above formula for Q can be set out in analysis of variance form, as 
m Table 7.11. 

TABLE 7.11 ANALYSIS OF VARIANCE FORM FOR Q FOR INTEGRAL VALUES OF x 

Degrees 
of freedom Sum of squares 

Between families of size 1 . . n l I $1(2 ^i) 2 

2 . . *, - 1 S,(W - 2i a )* 

.. ,. 8 . , n s - 1 S 8 (W - 3* $ )* 



Between means of families of different 

sizes < - 1 i& - ~z)* + 4w 4 (i t - *)* + - - . 

The mean square of the first line then gives the estimate s^ of the error 
variance of families of size 1, the mean square of the second line, divided 
by 4, the estimate of the error variance of family means of size 2, etc. The 
means of families of a given size can thus be compared for different parts of 
the population, remembering that the further divisors for 2 2 > etc -> depend 
on the numbers of families and not individuals entering into the means. 

7.12 Regression method : random sample 

The estimation of error in the regression method follows much the same 
lines as in the ratio method. The sum of squares of deviations from the ratio 
line is replaced by the sum of squares of deviations from the regression line, 
and the divisor n 2 is used instead of n 1, since an additional degree 
of freedom is accounted for by the fact that the regression line not only passes 
through the mean point, but has its slope determined independently from the 
data. 
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The sum of the squares of the deviations from the regression line is given 
by the equation 



-- s (*-*) 

so that 

t * = Ql(n - 2) 

and, if errors in b are neglected 



The error variance of 6, if the variance of y for fixed x is constant, is 



so that if the regression is truly linear the error variance of a standardized 
value y of y for the value x of x is 



The correction for finite population is here omitted since standardized values 
are ordinarily used for comparative purposes. 

Allowance for errors in b can be made in V (y) in the same way, but such 
errors will always be small relative to the other component of error. They 
will on the average increase the error variance approximately in the ratio 
/(n - 1). 

In the case of two-phase sampling the sampling variance of Xj will introduce 
the additional term b 2 V (5^) into the above expression for V (y). The 
general approach is given in Section 8.7. 

If an arbitrary value b Q of the regression coefficient is used, instead of the 
value b calculated from the data, the formula for the sum of the squares of the 
deviations from the arbitrary regression line becomes 

Q = S (y ~yf - 2b Q S(x - *) (y -y) + V S (* - *) 2 
and 

V = Ql(n ~ 1) 

This procedure is equivalent to the analysis of the values y b Q x from 
the individual sampling units. Use of the above expression for Q saves the 
trouble of calculating the values of y b Q x for the individual sampling units, 
at the expense of calculating the sums of squares and products of x and y 
instead of the sums of squares of y b Q x. 

No allowance has to be made for errors in , but an arbitrary value b Q 
should not be used for standardization unless it is known that b Q approximates 
closely to b, so that the error (& b) (x x) introduced into the 
standardization correction is small. 
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Example 7J2.a 

Estimate the errors of the estimates of Example 6. 12. a* 

We have 

Q = 164,904 - 0-19316 x 624,739 = 44,229 
sf = 0/123 = 359-59 

V (y) = (1 - 1/20) 359-59 / 125 = 2-733 = ( 1 -653) 2 
SJE. (Y) = 2496 X 1-653 = 4126 
359*59 

v = j-gpro - - 001112 - < - 1054)2 

Example 7.12.b 

Estimate the error of the estimates of total volume of timber of Example 
6.12.b. 

Except for the estimate derived from the arbitrary value of the regression 
coefficient i = 1 the computations follow the lines already given and are 
left as an exercise for the reader. 

When b = 1 we have 

Q = 115,266 2 x 52,069 + 82,296 93,424 
s*=* 93,424/24 = 3893 

The values of the error variance per unit, and the resultant standard errors 
of the various estimates, are as follows : 

Variance Relative 

per unit S.E. (total volume) efficiency 

Sample plots only , . . 4,803 710,000 cu. ft. 72 

Ratio method . 4,230 602,000 cu. ft. 82 

Regression, b = 1 . . . 3,893 639,000 cu. ft. 89 

Regression, b = '6327 . . 3,579 613,000 cu. ft. 96 

Regression, Z> = -55 . . 3,454 603,000 cu. ft. 100 

The relative efficiency of the various methods of estimation is inversely 
proportional to the value of the variance per unit. Setting the last value at 
100 we obtain the relative efficiencies of the last column. The relative 
efficiencies fall in the order given. If the information from the sample plots 
only is used, neglecting the eye estimates, about 40 per cent, more sample 
plots will be required to give results of the same accuracy as those obtained 
by using a regression of 0-55 on the eye estimates. 

The values for = 0-55 have been included to illustrate the fact that 
any value of the regression coefficient near to the value derived from the data 
will give results which are of about the same accuracy. Here there is an 

220 



ESTIMATION OF THE SAMPLING ERROS SECT. 7.13 

apparent small gain in accuracy owing to change in the degrees of freedom 
from 23 to 24. There is therefore no point in attempting to take account of 
small differences in the regression coefficient for different parts of the data, 
or to determine b very exactly. Any value reasonably near the correct value 
will give a satisfactory adjustment. 

The ratio method has given an estimate of the standard error which is 
relatively low because S (x) for the sample is high. The average performance 
of the ratio method may best be judged by the variance per unit. 

Limits of error can be assigned to the possible bias in the eye estimates 
by calculating the standard error of the difference x y, or of the ratio r. 
We have 

S.E. (x -y) = V(3893/25) = 12-5. 

The actual difference, 15-3, is 1-22 times its standard error, and these 
data therefore do not by themselves furnish conclusive evidence of the 
existence of bias in the eye estimates. Taking limits of error of twice the 
standard error gives limits to the bias of 40-3 and + 9-7, i.e. 27 per 
cent, and + 7 per cent. 

Similarly S.E. (r) = 0-098, and since r == 1-116 the ratio of the deviation 
of r from unity to its standard error is 1-19, which compares with the value 
of 1-22 for x y. 

As mentioned in Example 6.12.b, the more extensive data of the full 
survey confirmed the existence of bias, giving at the same time a more accurate 
determination of its average magnitude and variation for different types of 
woodland. 

7.13 Regression method : stratified and balanced samples 

(a) Uniform sampling fraction, regression coefficient the same for all 

strata : 
As for a random sample, except that 

Q = 2 Si (y - yif - b 2 Si ( y - y t ) (x - *) 



If an arbitrary value 2 is taken, the formula for Q must be rewritten in 
the same manner as in an unstratified sample, the divisor being n t. 
(b) Uniform sampling fraction, regression coefficients different for the 
different strata: 

bt = Si (y - yt) (x - xi)/Si (x - */)* 
If a pooled estimate of error is used, 

Q L = 2 Si (y -jf) 2 - S */ Si (y -#) (x - *,) 
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If the variances from the regression lines in the different strata ^are likely 
to be different, separate estimates of a? should be used when evaluating V (&/). 

(c) Variable sampling fraction : 

The method is similar to that outlined for the ratio method. 

(d) Balanced sample : 

As far as the estimation of error is concerned, a balanced sample must 
be treated as if it were an unbalanced sample from which estimates have been 
derived by the use of regression on the balanced variate. There will be no 
regression adjustment to the estimates of the population values, since x 
will be zero because of the balancing. 

7.14 Calibration of eye estimates 

When a regression is used to calibrate eye estimates, as described in 
Section 6. 15, the sampling variance of y can be split into three parts, that due 
to errors in ', that due to the sampling variance of ^ arising from the main 
sampling process, and that due to the variance of x x arising from the 
variance about the regression line. 

The component of variance due to errors in V is usually sufficiently small 
to be neglected. To a first approximation it equals 



where V (') is calculated in the ordinary manner from the regression. 

The variance of ^ is calculated from the values of * for all the selected 
units in the manner appropriate to the method of sampling adopted. The 
contribution to V (y) from this source is approximately 

VW 

A closer approximation is obtained by multiplying this variance by 
{V (x) V, (*)}/V (*), where V (#) is that part of the variance of x which 
contributes to V (%) and V, (x) is the residual variance of x about the regression 

line. 

The variance of % x due to variance about the regression line is 
calculated from the residual variance V, (a) of x about this line. If % and 
n represent the numbers of units in the original sample and the sub-sample 
for eye estimates, the contribution to V (y) when all units are given equal 
weight in the mean is 

(iia-^V^^/^Jiif! 

If the x's are weighted according to area a or other weights, the last expression 
becomes 



where S 19 S and S' indicate summation over the whole sample, over the 
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sub-sample, and over the part of the sample not included in the sub-sample, 
respectively. 

Example 7 .14 

Estimate the variance of the mean yield per acre obtained in Example 
6.15. 

The component of variance due to errors in I' is negligible, since ^ x 
is nearly zero. 

Since # x was calculated by weighting the weighted mean eye estimates 
of the individual farms by the wheat acreages of these farms, the ratio method 
of calculating the sampling error at the first stage is applicable. A table was 
therefore prepared giving for each farm, (1) the total wheat acreage, (2) the 
weighted mean yield per acre based on the eye estimates of all the chosen 
fields on that farm, and (3) the product of these two numbers. Columns (1) 
and (3) constitute, in the ordinary ratio notation, the x and y values. Using 
these tabulated values, and the ordinary formula for the variance of a ratio, 
with the inclusion of the factor (I/), we find V (x^ = 0-8200,* and 
consequently the corresponding component of variance is -8200/0 -6926 2 or 
1-7095. The factor (1 /) can properly be included here since for the 
majority of farms all the fields were taken. On the other hand, although 
the variance per field of the eye estimates is probably reasonably constant, the 
alternative approach outlined in Section 7.5, making use of this fact, would 
present difficulties, since the sampling units at the first stage are farms and 
not fields. The direct approach is therefore simpler. 

The residual variance about the regression line was found to be, from an 
analysis of the unweighted data for fields, V ; (x) = 7-038. The sums and 
sums of squares of the areas of the individual fields for which actual yields 
are and are not available, and of all fields, are 

S (a) = 610 S' (a) = 1279 5 X (a) = 1889 
S (a 2 ) = 15,172 S' (a*) = 33,899 S x (a*) = 49,071 

Substitution in the formula above gives a component of variance of 0*4137. 
Fields and not farms can reasonably be used here, since errors in the eye 
estimates may be expected to be reasonably independent from field to field. 
This is not the case with V (x^, since the yields of fields on the same farm 
often show considerable correlation. 

The standard error of the adjusted mean yield is therefore 
V(l*7095 -f 0-4137) = 1-46 bushels per acre. The main source of error 
is that due to sampling errors introduced by the variation in yields from farm 
to farm. The eye estimates are shown to be sufficiently consistent and to 
give adequate differentiation between differing yields. Regarded as an estimate 
of the mean yields of the fields actually sampled, the adjusted mean yield has 
a standard error of only -\/0-4137 = 0-64 bushels per acre. 

* The closer approximation gives the value 0-733. 
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7.15 Sampling with probabilities proportional to size of unit 

In this case unbiased estimates of the sampling variances are those based 
on the mean square deviation of r. When units selected more than once are 
included the number of times they are selected, no correction for finite 
population is required. If s? is the estimated variance of r we then have, for 
a random sample, 



If the size of the population is known we therefore have 



If the size of the population is not known, the sampling variance of the 
estimate of total size X is derived from the formulae for a qualitative variate 
(Section 7.4). We have, for a random sample, Mowing the notation of 
Section 6.16, 



Hence, if A is known exactly, 

n f n\ I X 2 n 
= - - 
n ** 



If A is not known exactly its estimation will contribute some slight 
additional variance, the amount of which depends on the precise method of 
location of the points. This, however, will in general be sufficiently small 
to be neglected. Substituting A = n^d for A, we have 



Example 7.15 

Estimate the sampling errors of the estimates of Examples 6. 16. a and 
6.16.b, given that the standard deviation per field of the yield per acre is 
3 '5 cwt per acre, and that the distribution of points can be taken as random. 

The standard error of the mean yield per acre is 

S.E. (f) * 3-5/<v/529 0-152 cwt. 
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For the estimates of Example 6. 16. a, 

V(X) = 2560* x^^ acres* 
S.E, (X) = 57,000 acres 



S.E. (Y) = 45,900 tons 
For the estimates of Example 6.16.b, 

V (X) = 640* x 22Q2 *^' 53 = 8*2-2 X 10" acres* 

oOyZO*) 

S.E. (X) = 29,000 acres 

V (Y) = 1,409,000 2 X 3-5 2 /529 + 15-7* X 842-2 x 10 6 = 2536 X 10 8 cwt 2 
S.E. (Y) = 25,200 tons 

Note that if the total area of crop were known accurately from other 
sources we should have 



S.E. (Y) = 10,300 tons 

If the acreage is not known a survey of this type will clearly be more efficient 
if sample harvesting is carried out at a proportion only of the sample points, 
the presence or absence of the crop being determined at the remaining points. 
This point is discussed in more detail in Section 8.17. 

If the sample is such that it can be regarded as stratified by districts it 
might at first sight appear that the between-districts component should be 
eliminated from the variance of r. Unless the crop areas of the different 
districts are accurately known, however, this must not be done, since the 
proportion of sampling points falling in the crop in a district will not be 
accurately proportional to the area of the crop in that district. 

If the sample points are confined to some localities only by a two-stage 
sampling process, with localities as first-stage sampling units, the above variances 
will represent the second-stage components of variance only. The full 
sampling errors must be determined from the first-stage units as explained 
in Section 7.17. 

7,16 Sampling from within strata with probabilities proportional to 
size of unit 

Since the number of units within each stratum is in general small, a pooled 
estimate $ r z of the within-strata variance of r based on the mean square 
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deviations of r within strata will be required. Consequently 

^ 2==: n~t 

We then have 

and thus 



V (r) = V (Y)/X 3 

If the variation in the within-strata variances is la.rge it may be necessary 
to introduce weights when forming the pooled estimate of error, in order to 
avoid bias. 

The correction for finite sampling is not required if the units selected 
more than once are included the number of times they are selected. In the 
more usual case in which each unit is included once only, additional units 
being selected by the method of Section 3.10, the correction for finite sampling 
should be included. In this case, since probability of selection is not strictly 
proportional to size, the formulae are approximate only. 

Example 7.16 

Estimate the sampling errors of the estimate of Example 6.17 

The analysis of variance of the values of the ratio for the individual 
"combined" parishes is given in Table 7.16. It follows the lines of 
Example 7.7. 



7.16 ANALYSIS OF VARIANCE OF THE VALUES OF THE RATIO 

Degrees of Sum of Mean 

freedom squares square 

Between districts . . 6 -04952 -008253 

Within districts . . 10 -02649 -002649 



TOTAL .... 16 -07601 -004751 

This gives a value of s r * of 0-002649. We then have 
S {Xf a (1 -fi)/ni} = 22,932* x f + 43,591 2 X if/3 + . . . = 37-188 X 10 8 
S.E. (Y) = i/(0'002649 X 37-188 X 10 8 ) = 3140 

It will be noted that the variance of r within districts is less than the 
overall variance. Consequently stratification by districts appreciably reduces 
the sampling error. 

7.17 Multi-stage sampling 

If the sampling fraction at the first stage is small, the total sampling error 
of multi-stage sampling is obtained from the first-stage unit values, estimating 
each unit value from the results of the sampling at the second and following 
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stages, and using the method of estimation appropriate to the method of sampling 
at the first stage. The additional variability contributed by the second and 
following stages is automatically included in this estimate of error. 

If, on the otker hand, the sampling fraction/' at the first stage is not 
sufficiently small for the factor (1 %./') to be neglected, the sampling error 
will be increased on account of the fact that the selected first-stage unit values 
are themselves subject to sampling error, instead of being known exactly, as 
in single-stage sampling. The increase in the sampling variance of whatever 
estimate is under consideration will be equal to/' times the variance in this 
estimate resulting from the sampling at the second and following stages. This 
variance can be calculated by regarding the selected first-stage units as strata 
which are sampled by the sampling at the second and following stages, 

Thus, for example, in two-stage random sampling, with n' selected first-stage 
units, and n" second-stage units selected from each first-stage unit, if s' 2 is 
the estimate of the sampling variance of the first-stage unit means, i.e. the 
means of the second-stage values for the separate selected first-stage units, 
$" 2 is the variance of the second-stage units about the first-stage unit means, 
and/' and/" are the first- and second-stage sampling fractions, the sampling 
variance of the mean of the selected first-stage units will be (1 /") $" 2 /V, 
since this mean is based on n'n" second-stage units, and therefore the sampling 
variance of the mean of the population is 

1 f 1 f" 

V (v\ __ c /2 4- f ' J c" a 

v- , f +/ , f . 



Example 7.17 

Calculate the sampling errors of the estimates of the mean dressing of 
nitrogen per acre obtained in Example 6.19. 

The y's and #'s entering into the ratio method estimate at the first stage 
are the successive terms of the expressions for S(g"y) and S (g"x)> already 
given for small farms, namely 1-36, 3-78, 3-30, . . . and 2, 6, 6, . . . 
respectively. 

The calculation of the sampling error follows the lines indicated in 
Section 7.10, the ratio (cwt. nitrogen per acre) being assumed, for the reason 
given at the end of Section 7.9, to be the same for all strata. 
The values of Sqp are found to be 

Small farms : 27-519/21 = 1-3104 

Medium farms : 583-81/35 = 16-680 
Large farms : 2779-1/8 = 34=7-39 

We then have, neglecting the factors (1 /*), 

V (r) = (105 2 X 22 X 1-3104+ 59 2 X 36 X 16-680+302 X 9 X 347-39)/58,229 
= ( 0-0392)* 
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The sampling fractions are here all small and the variance at the second 
stage therefore need not be considered. With the present material this variance 
could not in any case be estimated since only one field per farm was selected. 
In such cases, when the ft are not small, it is still best to neglect them. The 
sampling error will then be slightly overestimated. 

Farms without sugar beet have been excluded from the above calculation. 
Their inclusion, though substantially decreasing the values of %- 2 , would 
make little difference to the final estimate of error, since the values of m in 
the formula for V (Y) would be correspondingly increased. 

From inspection of the data, and from the nature of the material, we may 
expect that the variance of the mean dressing per acre r will be substantially 
constant, irrespective of the size of field to which it is applied. The alternative 
procedure of calculating the variance of r directly without any weighting may 
therefore be followed without serious risk of introducing any marked bias 
into the estimate of error. 

TABLE 7 . 17 ANALYSIS OF VARIANCE OF DRESSINGS OF NITROGEN PER ACRE 

Degrees of Sum of Mean 

freedom squares square 

Small farms 21 1-1304 0-0538 

Medium farms. 35 1-3500 0-0386 

Large farms 8 0-9836 0-1230 

TOTAL 64 3-4640 0-0541 

The within-size-groups sums of squares and mean squares of r are shown 
in Table 7.17. There is no marked difference between the different size-groups, 
and in the following calculations we will therefore use the pooled estimate of 
the mean square, $ r 2 = 0*0541. 

From the formulas of Section 7.5 we have, for a single stratum, 

and 

where the #*s are those entering into the first-stage sampling, i.e. g"x in the 
full notation. The sums of squares of g"x will be found to be 580, 15,245 
and 39,993 for small, medium and large farms respectively. Consequently, 
summing over all strata as before, omitting the (1 /*), and taking the variable 
sampling fraction into account, we have 

V (r) = 0-0541 (105 2 X 580 + 59 3 x 15,245 + 30* X 39,993)/58,229* 
= ( 0-0391)* 

This is almost identical with the value previously obtained. 

The comparison between the two methods of calculating the variance may 
be taken a stage further by estimating the values of Sn 2 from those of Sqf and 
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comparing them with those obtained directly. Equating the two expressions 
for V(Fz'), we find s r i* = HI Sqf/Si (g"x) 2 . Using the values of %- 2 already given 
we obtain for s r f the values 0-0497, 0-0394, 0-0782 respectively. These show 
no consistent divergence from the values of Table 7. 17, and we may therefore 
conclude that the bias in the estimation of error by the second method is likely 
to be small. A more thorough investigation could be made by tabulating a 
number of comparisons of the above type from various batches of similar 
data. 

The value of s r 2 given by Table 7 . 17 is directly appropriate for the calculation 
of the variance of the estimate from the unweighted means (Table 6.19.c). 

The sampling standard error of the mean dressing over all fields is 
V^O -0541/67) = 0-0284. This standard error does not include any errors 
due to bias, but will be appropriate, or at least approximately so, to comparisons 
of such a nature that the major part of the bias is eliminated. If there were 
large differences between size-groups the question of whether the pooled 
within-size-groups variance or the overall variance is appropriate to the 
comparison in question would have to be considered this, however, involves 
other problems, such as how far the differences observed are due to differences 
in size-group proportions (see examples 7.5 and 7.7.b). 

It will be noted that the standard error of the properly weighted ratio 
estimate is considerably greater than the standard error of the straight mean. 
The ratio of the squares is 0-0392 2 /0-0284 2 = 1-91. Thus about double the 
number of farms, excluding those without sugar beet, are required to attain 
the same accuracy when unbiased estimates are required. This is inevitable 
in a survey of this kind where the sampling fractions cannot be adjusted so as 
to be proportional to the areas of the crop being sampled, a course which is 
impossible when a number of crops are covered in the same survey, even if 
the necessary information is available. 

7.18 Systematic samples 

No fully valid estimate of the sampling error of a systematic sample is 
possible, since the units are not located at random within defined strata. 
Approximate estimates can be made in various ways. The simplest, which 
will suffice for most census and survey work, is to divide the material arbitrarily 
into strata, and calculate the sampling error as if the units were selected at 
random from these strata. 

In the case of a systematic sample from a list it will usually be sufficient to 
take account of the major groupings of the list, treating these as strata, and 
to ignore any minor and ill-defined groupings. An example of this has already 
been given (Example 7. 7. a). 

In the case of one-dimensional systematic sampling, e.g. equally spaced 
points on a line, or equally spaced lines covering an area, the strata may be 
taken to contain pairs of successive units, so that the error variance is estimated 
from the differences between the members of the pairs. Each difference 
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contributes one degree of freedom. If there are ri such differences d, the error 
variance per unit is therefore 



Since the pairing is arbitrary, instead of taking alternate differences between 
successive units all differences may be taken. This is equivalent to taking two 
sets of overlapping strata. The accuracy of the estimate of s 2 is thereby somewhat 
increased, though it is not doubled, owing to lack of independence between 
the successive differences. 

In the case of two-dimensional sampling on a square or rectangular pattern 
the strata should consist of sets of four units in a 2 x 2 pattern. By this means 
variability in both directions will be taken into account. There is no point 
in taking overlapping strata. Since each such stratum contributes 3 degrees 
of freedom the formula for the error variance per unit is 



where n f is the number of strata. 

In the case of line sampling a complication arises if all the lines are not of 
approximately equal length. If the total area covered by the sample is known, 
the most accurate estimate of the quantity under consideration will be obtained 
by the ratio method. In this case the calculation of the sampling error should 
strictly follow the method given in Section 7 . 9 for a stratified sample estimated 
by means of a constant ratio. With strata of two units the formula for Q 
becomes 

Q = 1S(4 2 ) - 2 r . I S(d x d y ) + r* J S (dj) 



This will eliminate the variability due to variation in length of line. If the 
total area is not known, so that the final estimate is obtained by multiplication 
of the total over all the lines by the appropriate raising factor, the difference 
method given above, and not the ratio method, must be used. 

These methods of estimation of the sampling error are also applicable to 
line samples in which the lines are randomly located in pairs within blocks 
and thus form a proper stratified random sample. In this case the estimate 
of error will be fully valid. 

In either systematic or stratified random line sampling, the variation in 
the length between neighbouring lines will not be large unless the boundaries 
of the area covered are very irregular. Consequently the approximate method 
based on the direct differences can be used in most cases without serious 
inaccuracy. 

The above methods of estimation of error for systematic samples will give 
overestimates of the sampling error, provided there are no periodic features in 
the material, and provided in two-dimensional sampling that there are no 
marked strip effects running in straight lines across the material in such a 
manner that the whole of one line of sample points falls on the same strip. 
If a closer estimate is required, an alternative, but rather more complicated, 
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procedure is available. In one-dimensional sampling, instead of taking successive 
differences, differences of the type 



can be taken. Such differences may be called balanced differences. Most of 
the systematic component of variation is thus eliminated. The number of 
terms included in each difference is to a certain extent arbitrary, but 9 is 
chosen as a convenient compromise. With extensive material there will be 
no need to take overlapping differences, the best procedure being to have 
overlap of the end terms only, so that the y Q of the first difference is taken as 
the y 1 of the second. With this convention the sum of all the differences is 
equal to one-half the first and last included terms plus the sum of all the 
remaining odd terms minus the sum of all the even terms. The square of each 
difference contributes one degree of freedom, the divisor being given by the 
sum of the squares of the coefficients, i.e. 7-5. Consequently s 2 = S (^ 2 )/7'5 n f . 
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FIG. 7 . 18 COEFFICIENTS FOR CALCULATING THE ERROR OF A SYSTEMATIC 
TWO-DIMENSIONAL SAMPLE 

A similar procedure can be followed in the case of two-dimensional 
systematic sampling, the most convenient type of difference being that given 
by the coefficients shown in Fig. 7.18. Here again, the margins of the square 
covering one difference may be taken as the margins of neighbouring squares. 
The divisor in this case will be 6j. 

The estimates provided by balanced differences will also in general be 
overestimates of the sampling error, but may be expected to be closer than 
those based on ordinary differences. If there is no wide discrepancy between 
the two types of estimate it may be concluded that the degree of overestimation 
is not likely to be great. More exact estimates can only be obtained by taking 
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supplementary observations at intermediate points allocated either at random 
or systematically. The one-dimensional case has been discussed in detail 
by Yates (1948, A). 

The above methods of estimation can be applied both to quantitative and 
qualitative data, but in the case of qualitative data, based on either one- or 
two-dimensional point sampling, a rapid estimate of the sampling error can be 
made by using the formulas for a random sample, as in Example 7.15. This 
will tend to give greater overestimation of the sampling error than the above 
methods, but if the parts of the line or area possessing the attribute are small and 
irregularly distributed, with no great variation in density in different parts of the 
line or area, the estimate will be sufficiently good for most practical purposes. 

Example 7.18 

In the 1942 Census of Woodlands the total area of woodland shown on 
the maps was determined for each county by estimating the area of land 
coloured green on the 1-inch O.S. maps. This was done by measuring the 
total length of the E-W kilometre grid lines which fell in green areas* The 
results for O.S. sheet No. 115 covering part of Kent are given in Table 7.18. 
Estimate the sampling error of this process. 

TABLE 7.18 WOODLAND AREAS FROM LINE INTERCEPTS (cm.) 



Grid 
line 


Length 
of line, 

X 


Length 
coloured 
green, 


Successive 
differences, 


Grid 
line 


Length 
of line, 
x 


Length 
coloured 
green, 


Successive 
differences, 






y 








y 




98 


3-5 


0-0 




S3 


30*0 


3-8 


+ 1-4 


97 


4-2 


0-9 


-f-0-9 


82 


29*4 


4-1 


+ 0-3 


96 


9-2 


0-0 


-0-9 


81 


29-1 


4-9 


+ 0-8 


95 


12-6 


0-0 


0-0 


80 


28-8 


6-0 


+ 1*1 


94 


15-5 


0-3 


4-0-3 


79 


28-6 


5-4 


-0-6 


93 


21-2 


0-1 


-0-2 


78 


28-2 


2-3 


3*1 


92 


25-2 


0-5 


-fO-4 


77 


27-2 


2-9 


+ 0-6 


91 


25-4 


3-1 


+ 2-6 


76 


26-3 


2-1 


0*8 


90 


31-2 


2-8 


-0-3 


75 


25*4 


6-3 


-f4-2 


89 


34-2 


2-7 


-0-1 


74 


25-5 


8-2 


+ 1-9 


88 


34-1 


2-8 


4-0-1 


73 


25-2 


5-4 


- 2-8 


87 


33-0 


2-6 


-0-2 


72 


24-9 


6-6 


+ 1-2 


86 


31-4 


2-3 


-0-3 


71 


24-6 


6-6 


0-0 


85 


31-0 


3-5 


+ 1-2 


70 


20-8 


4-1 


-2-5 


84. 


on .7 


2 .A 


1 1 










OTC 


O\J i 


^t 






716-4 


92-7 





The successive differences d y of the lengths coloured green, jy, are shown 
in the fourth and eighth columns. We find S (d y z ) = 63 '21 and consequently, 
since there are 29 lines, 

s = 163-21/28 = 1-1288 
S.E. {S (y)} = V(29 * 2 ) - 5-72 
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A length corresponding to 1 km. represents an area of 1 sq. km. 
Consequently the raising factor to be applied to the total length measured in 
cm. to give the estimated area in acres is 63,360 X 247-11/100,000 = 156*57. 

The total area of woodland is therefore 

156-57 x S(y) = 156-57 x 92-7 = 14,514 acres 

and the standard error of this area is 156*57 x 5-72 = 896 acres, i.e. 
6-2 per cent. 

The same procedure can be followed for the other maps covering the county, 
and the square root of the sum of the squares of the resultant standard errors 
will give the standard error for the whole county, since the errors of the different 
maps are virtually independent. The results for Kent gave a percentage 
standard error of 3-4 per cent. 

If the ratio method is used the corresponding successive differences d x of 
the total lengths of the grid lines, and the sums of squares and products S (dx 2 ) 
and S(d x d y ) are required. The latter are found to be 159-23 and + 2-62 
respectively for the map in question. Using the ratio r =0-12940 derived 
from this map, we find 

sf = >/28 = 1-1643 

As expected, there is no appreciable difference in the error calculated by 
the two methods. The simpler method is consequently all that is really 
required, even when the total area of woodland is calculated from the total 
area of the county and the ratio of the length coloured green to the total length 
of the grid lines. If, however, the first grid line is much shorter than the rest, 
owing to its cutting the map boundary at a small angle, it should be omitted, 
or the length made up by taking the relevant part of the line on the neighbouring 
map. This trouble will only arise if the error is estimated separately for each 
map and the grid lines are not exactly parallel to the map boundary. 

On the other hand, the calculation of the error from the total variance of 
y, i.e. without stratification, would give very misleading results. 

7.19 Sampling on successive occasions 

It will be sufficient if we record the variances of the estimates given in 
Sections 6.21 and 6.22. 

(a) Two occasions only : sub-sample on the second occasion. 

V (y) = {V (y) - ^* V(*)}/A = (1 - /a*) V (j)/A 
V (y - x) = { V (y) + (X - 2A6 - ^ 2 ) V (*)}/* 

If the population value on the second occasion is estimated by adding the 
estimate of change derived from the sub-sample only to the overall mean on 
the first occasion, i.e. by y* x f + x, the variance of this estimate will be 



p (2b - 1) V (*) 
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(b) Two occasions only : part of the sample replaced on the second occasion. 



or, in the case of unequal numbers on the two occasions, 



n i - 



With equal numbers on the two occasions, the variance of the estimate 
of change given by formula 6.21.b is approximately 



u v 

(change) = 



The variance of the estimate given by the difference of the means of the 
units occurring on both occasions is approximately 

V (y' - if) = (1 - r) { V ft) + V (*)}/*, 
and of that given by the difference of the overall means is approximately 

V (y - *) = (1 - V) {V (y) + V (*)}/ 
The exact expressions in the last two cases are given by replacing r by 

2covfcjO/{V(*) + V(y)} 
which is equal to r when V (#) = V (y). 

(c) Successive occasions : same fraction replaced on each occasion. 
The limiting value of V (y/z), subject to the restrictions mentioned in 
Section fi . 22, is as follows : 



The variance of the estimate of change given by y/i y/t - 1 is 



Example 7. 19. a 

Estimate the sampling errors of the estimates of Example 6.21. 

V (#) and V (y) may reasonably be taken as equal. The pooled estimate, 
based on all the observations on each occasion (22 degrees of freedom) is 
0-08767. 

We then have 

V ffw ) - 0-08767 (1-jx 0-8*7^ _ 
V IW- 12(1 -J xO-847 2 ) ' 

This may be compared with the variance of the overall mean y, which is 
0*08767/12, i.e. 0-0855 2 . The ratio of these variances is 1-210. Thus the 
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gain in efficiency by the use of the information provided by the sampling on 
the first occasion is 21 per cent 
Similarly 

2 x 0-08767 (1 - 0-847) 
12 (1-Vx 0-847) - 

The variance of the change estimated from the units sampled on both occasions 
is 2 x 0-08767 (1 0-847)/8, i.e. 0-0579 2 . The ratio of these variances is 
1-077, and the gain is therefore 8 per cent. If the change is estimated from 
the difference of the overall means the variance is 

V (y - *) = 2 x 0-08767 (1 - f X 0-847)/12 = 0-0798 2 

The ratio of this variance to the first variance is 2-042, and the gain is therefore 
104 per cent. 

It will be noted that when V (x) is taken as equal to V (y) all the above 
gains depend solely on the values of A and r. 

Example 7.19.b 

Estimate the sampling errors of the estimates of Example 6.22. 

Owing to variation in the numbers on the different occasions the above 
formulas will only give approximate estimates of the sampling errors. Excluding 
January, the average number of observations per occasion is 9-8 and the average 
value of A is 0-664. Since r = 0*811, 1 r 2 = 0-343, and hence 

- - <> 34 3 + V[Q-343 (1 - 0-657 X 0-108)] _ 

~ 



~ 2 x 0-664 X 0-657 

V(y) was found to be 0-0871, and hence 



v * - * - d - 






7.20 The error graph 

In various instances in the preceding sections pooled estimates of the error 
variance have been used. Such pooled estimates are only legitimate if the 
error variance is reasonably constant over the parts of the population for which 
the pooling is carried out. In many types of material such constancy does not 
exist, and in such cases, when the number of degrees of freedom is too small 
for accurate determination of the error variances of the different parts of the 
population, other procedures must be followed if sampling errors are required 
separately for the different parts. 
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The simplest and most convenient device for practical use is the error 
graph. The estimates of the error variance are plotted against some other 
characteristic of the parts of the sample which is believed to govern ^the 
magnitude of the error variance, and a smooth curve is drawn to fit the points 
as closely as possible. This curve gives the variance law from which revised 
estimates of the variance can be obtained for any given value of the determining 
characteristic. 

Fig 7.20 shows a graph of this kind obtained in the course of a survey 
of wireworm infestation in grass fields. In each field 20 cores were taken and 
the wireworms counted in each core. There are thus 19 degrees of freedom 
for the determination of sampling error in each field. The estimates of the 
percentage variance so obtained were plotted against the estimated number 
of wireworms per acre in the various fields. The smooth curve so obtained 
was used to provide a table of errors that might be expected in similar sampling. 
Table 7.20 gives a small abstract of this table, and also of the inverse table, 
obtained by interpolation from the first table, giving the fiducial limits associated 
with any given observed number of wireworms. This procedure is approximate 
in a number of respects, but detailed discussion would be out of place here. 

TABLE 7.20 DISTRIBUTION AND PROBABLE LIMITS OF ERROR OF SAMPLE 

ESTIMATES OF WIREWORM POPULATIONS OF GRASS FIELDS SAMPLED BY TWENTY 
4 IN. CORES (1 CORE = 1/500,000 ACRES) 

(1,000 per acre) 



True 
population 


One-eighth of sample 
estimates 


Estimated 
population 


Population which, in one- 
eighth of cases, would 
give an estimate 


less greater 
than than 


not less not greater 
than that than that 
observed observed 


200 


105 295 


200 


128 326 


400 


260 540 


400 


284 567 


600 


428 772 


600 


451 804 


800 


597 1,003 


800 


624 1,040 


1,000 


766 1,234 


1,000 


797 1,277 



There are, of course, various other methods of dealing with problems 
of this type. In biological work the original variates are often transformed 
to other variates, such as logarithms or square roots, which may in the material 
in question be expected to have a more constant variance, and thus permit 
pooling of the estimates of error. Such procedures introduce a number of 
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complications ; in particular the means of the transformed variates, when 
transformed back into the original variates, will be biased. They are not 
generally necessary or advisable in sample censuses and surveys. 



,550 




250 



500 750 

POPULATION. 1000 PER ACRE 



1000 



1250 



FIG. 7.20 STANDARD ERRORS PER UNIT CORE OF 4 IN. DIAM. 
(WiREWORM SURVEY, 1940-1) 

o means for 2272 fields grass in 1940 fitted to data from grass fields 

means for 525 fields arable in 1940 - ~ - Poisson distribution 

Reproduced from Yates and Fitmey (1942, J) with the 
permission of the editors of the Annals of Applied Biology, 

287 



SECT. 7.21 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

7.21 Sub-sampling for the estimation of error 

In an extensive survey the calculation of the sampling error from the whole 
of the data would be very laborious, and would provide estimates of error 
which are unnecessarily accurate. In order to cut down the work a sub-sample 
of the whole of the material may be taken, or estimates of error may be calculated 
for certain parts of the survey only, e.g. certain strata, with or without sub- 
sampling. . 

A convenient method of sub-sampling, which is applicable if there are a 
large number of separate strata of approximately equal size and a pooled 
estimate of the error variance per unit is required, is to select a random pair 
of units from each stratum, and to take the differences between the two 
members of each pair. In this case each difference d contributes one 
degree of freedom. If there are t differences the estimate of <r 2 is given by 



. . 
If the strata are few in number and of unequal size this method is not 

applicable, since the number of differences would be inadequate and the 
different strata would not be represented in proportion to their size. In general 
it is important to see that the contributions to the error variance from the 
different parts of the population are substantially the same in the sub-sample 
as they would be if the whole of the data of the original sample were used. 
For this reason the sub-sample should in general be obtained by the use of a 
uniform sampling fraction over the whole of the original sample. A systematic 
method of selection will usually be satisfactory. 

The taking of a sub-sample in this manner is somewhat troublesome, and 
also prevents accurate comparisons of the errors of parts of the survey which 
are in themselves small and therefore inadequately represented in the sub- 
sample. For these reasons the more convenient method of calculating the 
sampling error for certain parts of the population only is often employed. 
This procedure will lead to inaccuracies if the variability of the omitted portions 
is different from those that are included, but these inaccuracies can be reduced 
by selecting the parts to be included on a proper random basis. Thus in the 
1942 Census of Woodlands the sampling error was calculated by selecting 
two counties at random from each of the seven regions, the data of the first 
5 per cent, sample only being used. The surveyed quarter-sheets within each 
of these counties, which were selected on a systematic grid pattern, were treated 
as if they were a random sample from all the sheets of the county. 

With grouped data the calculation of the sampling error from the whole 
of the data may well not present any appreciably greater labour than the use 
of a sub-sample, and in such cases the whole of the data will naturally be used. 

7.22 Ronndisig-off and grouping errors 

If a constant grouping interval over the whole of the range is adopted, 
the additional variance per unit introduced by the grouping is $ of the square 
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of the grouping interval. If the true variance per unit is a 2 and the grouping 
interval is a, the total variance per unit of the grouped data is consequently 
a 2 + ia 2 , and the fractional loss of accuracy due to grouping is a 2 /! 2 a 2 . 

Rounding-oif is a form of grouping. In terms of units of the last place 
that is retained, the variance per unit due to rounding-off is equal to 3^. 

If the variance is estimated from rounded-off or grouped data the additional 
variance due to grouping will be included in the estimate. If for any purpose 
an estimate of the true variance is required a deduction of % the square of 
the grouping interval must be made. This is known as Sheppard's correction. 

Grouping will also result in some loss of accuracy in the estimation of the 
variance. In a sample from a normal distribution the fractional loss of 
information due to this cause is a 2 /6 a 2 . 

The above formulae are approximate, but provided the distribution is 
reasonably symmetrical they can be used in all cases ordinarily occurring in 
practice, in which a is not likely to be greater than, and is usually considerably 
less than, a. 

If the distribution is markedly skew, however, grouping may introduce 
a bias in the estimates which is not included in the above variance formulae. 
An extreme case is provided by distributions in which there are a large number 
of small values and only few large values, such as the distribution of acres 
of crops and grass, or of wheat, in the farms of Table 6 . 6 . a. In this type of 
distribution, if grouping is used, a smaller grouping interval must be employed 
for the small values than for the large values, as in Example 7.2,b. If only 
comparisons between similar estimates are required a coarser grouping may be 
adopted than if absolute values of the estimates are required, since in the 
former case any bias introduced will affect all estimates similarly. 

Example 7.22 

Determine the loss of accuracy in the estimation of the mean due to 
grouping in the data in Example T.l.b. 

The grouping interval is 300, and the standard deviation per unit (including 
grouping errors) is 526. Their ratio is 0*571, and the fractional loss of 
accuracy is therefore tV X 0-571 2 or 2-7 per cent., i.e. equivalent to 4 of the 
162 families in the sample. 

7,23 Determination of errors due to bias 

As has been pointed out in Chapter 2, bias can arise either in the selection 
of the sample, or in the estimation process. 

Although biased methods of estimation can in general be avoided, occasions 
arise, such as that discussed in Example 7.17, where biased estimates are 
considerably more accurate than the corresponding unbiased estimates. The 
biased estimates are also sometimes considerably simpler to calculate than the 
unbiased estimates. For these reasons it is sometimes advisable to use biased 
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estimates for comparative purposes. Such estimates can clearly be used with 
more confidence if the amount of bias that they introduce is not large. 

In general it is possible to make an estimate of the expected magnitude of 
a bias in estimation by a combination of mathematical analysis and detailed 
numerical analysis of the sampling results. Such methods, however, are 
complicated and vary with the type of sampling adopted. We shall therefore 
not describe them here. 

An alternative and relatively simple method is to compare the biased 
estimates with the corresponding unbiased estimates of the same quantity. 
Any one comparison will of course be affected by random sampling errors in 
both the estimates, but if the material is sufficiently extensive to provide a 
number of comparisons, the mean difference will provide an estimate of the 
average bias whose accuracy can be judged by the variation of the individual 
differences. 

Bias in the selection of the sample can in general only be assessed by 
comparison with another sample known to be free from bias. If, however, 
the distribution of some supplementary variate is known, bias in selection 
can sometimes be assessed, and if necessary eliminated, at least in part, by 
the use of regression. The calibration of eye estimates by means of regression, 
described in Section 6.15, provides an example of this procedure. As already 
pointed out, the procedure is stibject to qualifications and cannot be relied on 
to compensate for all possible sources of bias. The only certain guarantee 
that bias is absent is the use of methods of selection and observation which 
are free from bias. 

Example 7.23 

Assess the evidence for bias in the estimate of the dressing of nitrogen 
per acre derived from the unweighted mean over all fields of Example 6.19. 

The results already given in Tables 6.19.b and 6.19.C show that in each 
size-group the unweighted mean dressing is less than the weighted mean. 
The apparent bias in the overall unweighted mean dressing is 0-052. A 
larger number of comparisons of the same type may be obtained by dividing 
the 22 farms of the small-size group into two groups of 11 farms each, and 
the 36 farms of the medium-size group into four groups of 9 farms each. 
Division of Table 6. 19. a into blocks in this manner gives the comparisons 
shown in Table 7.23. 

Six out of the seven differences are negative. The mean difference is 
0-040, and the standard error of this difference, estimated from the sum 
of the squares of the deviations of the individual differences from their mean, 
is i 0-022. The evidence for bias on this small amount of data is therefore 
not conclusive. 

The procedure has here been adapted to the data given in Table 6. 19. a. 
If the sampling were systematic the division of the groups should be made 
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by a random or systematic process such that each of the sub-groups constitutes 
a substantially random sample of the whole of the group. The procedure is 
also subject to the qualification that any bias arising from differential weighting 
of the different size-groups will be excluded by this method of estimation, 
since the differences are based on comparisons within size-groups. 

TABLE 7.23 ESTIMATION OF BIAS 

Weighted Unweighted Difference 

mean mean 

. -521 -484 - -037 

Small farms 

378 -362 - -016 



Medium farms 



584 -537 - -047 

417 -446 f -029 

503 -470 - -033 

I '378 -361 - -017 

Large farms . . -576 -416 - -160 

An alternative method of subdivision which overcomes this limitation is 
to form sub-samples in which all the size-groups are represented in the correct 
proportions. If the data from a number of counties are available, no subdivision 
will be necessary, since the differences between size-group means and between 
the overall means for the different counties will provide all necessary 
comparisons. 

7.24 Interpenetrating samples : comparison of observers 

The error variance of the difference between two observers, estimated from 
interpenetrating samples, can be obtained by calculating the error variance 
appropriate to each observer, and adding these variances. This procedure, 
however, is subject to certain qualifications. In the first place the correction 
for finite sampling must not be applied. In the second place only those 
components of variance must be included which affect the comparisons between 
the observers. Thus, if a two-stage sampling process is adopted and each of 
the selected primary units is sampled by both observers, only the second-stage 
sampling error will enter into the comparison between observers. 

If the data relevant to the comparison between two observers are at all 
extensive it is often possible to make a direct estimate of the error of this 
comparison by subdividing the material so that a number of independent 
differences are obtained, in the same manner as in the estimation of bias from 
different methods of estimation. Thus, with the above two-stage sampling 
process the difference between the observers might be obtained for each 
primary unit separately. 
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7.25 Estimation of the sampling error from duplicate samples 

If a survey is carried out in two or more interpenetrating parts and the 
results are tabulated separately, an estimate of the sampling error can be 
obtained from the differences of the two samples. For such an estimate to be 
of any value there must be a number of independent differences, so that at 
least a moderate number of degrees of freedom are available. Even with 
extensive surveys the number of available differences is likely to be small, 
so that such estimates are usually rather rough. Nevertheless they are useful 
when the detailed results are not available. 

If the two samples are distinguished by single and double dashes, and the 
estimate of the population total is given by the sum of the t parts 1, 2, etc., 
we have Y' = Y/ + Y 2 ' + . . . , and Y" = Y/' + Y 2 " + . . . If the sizes 
of the two samples are in the ratio A : ju, where X + JLL == 1, the estimate Y 
of the population total from the two samples is AY' + /^Y", with similar 
expressions for Y 1} etc. An unbiased estimate of the error of Y is given by 

V (Y) = fa (1 -/) {(Y/ - Y/') 2 + (Y/ - Y 2 ") 2 + . . .} 

where / is the sampling fraction for the whole survey. When the parts vary 
considerably in size this estimate is very inefficient, since excessive weight is 
given to the larger totals. If the approximate relation between the variances 
of the totals of the parts is known, a more efficient estimate can be obtained, 
though this will be biased if the assumed law of variance is incorrect. If the 
variances are proportional to Y x , Y 2 , etc., the efficient estimate is 



V (Y) = -~~^ ^ {(Y/ - Y/O'/Yi + OY Y 2 ") 2 /Y 2 + . . } 

This law of variance is likely to be approximately true for area surveys if the 
density per unit area of the quantity surveyed does not vary very greatly from 
part to part. 

If the variances are proportional to some other quantity, such as the number 
of units in each part, these numbers must be substituted for Y l9 Y 2 , etc. and 
their sum for Y in the above formula. 

Example 7.25 

In the 1942 Census of Woodlands the total volumes of timber for the seven 
regions of the survey obtained from the first and second 5 per cent, samples, 
excluding areas surveyed in 1938-9, and with allowance for felling in the 
interval between the two samples, are shown in Table 7.25. Estimate the 
sampling error to which the combined estimate of the total volume of timber 
for the country is subject. 
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TABLE 7.25 VOLUMES OF TIMBER IN THE DIFFERENT REGIONS ESTIMATED FROM 
THE FIRST AND SECOND 5 PER CENT. SAMPLES OF THE 1942 CENSUS OF 
WOODLANDS 



Region 


Volume, m. cu. ft. 




(A - B) 


Sample A 


Sample B 


Mean 


A B 


Mean 


A 


253 


201 


227 


4-52 


11-0 


B 


74 


84 


79 


-10 


1-3 


C 


100 


107 


104 


7 


0-5 


D 


148 


164 


156 


-16 


1-6 


E 


209 


227 


218 


- 18 


1-5 


F 


94 


78 


86 


+ 16 


3-0 


' G 


112 


119 


116 


7 


0-4 


990 


980 


986 


+ 10 


20-2 



The computations are shown in the last three columns. The sum of 
squares of the differences A B is 3738, and therefore from the first formula 
the standard error of the total is 

V(i X i X A X 3738) = 29-0 m. cu. ft. 

The sum of the last column is 20-2, and therefore by the second formula the 
standard error is 



V(986 



20-2) == 25-3 m. cu. ft. 



It will be noted that an estimate of the standard error of any regional total 
can be obtained directly from the second formula by substituting this total 
for the grand total 986. 

The above estimates are very rough, since they are based on only 7 degrees 
of freedom. The estimate obtained by the method described in Section 7.18 
j s -)-- 19-7 m. cu. ft. With a more extensive set of comparisons between the 
two samples a more accurate estimate could be obtained. 

7.26 Presentation of sampling errors in extensive surveys 

In the preceding sections the methods of estimating the sampling error 
of single estimates, e.g. of the population mean, have been described. In 
extensive surveys the results will usually be broken down in various ways, 
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so that the final tables will contain a large number of estimates relevant to 
different parts of the population. The errors of these different estimates could 
be calculated separately by the methods already given, using pooled estimates 
of the error variance per unit where appropriate, but such a procedure would 
be laborious, and even if it were carried out, the presentation of the separate 
standard errors in the tables of results would make these tables somewhat 
confused. 

From every point of view, therefore, it is desirable to have some condensed 
method of presenting the errors of the component parts of elaborate tables. 
This can be effected by making use of some error law, either theoretical or 
empirical, which will enable the standard error of any particular component to 
be rapidly obtained when required from other information available in the 
table. The exact form of the law will depend on the type of material and the 
nature of the information tabulated. 

The simplest case is that in which the tables contain means of a quantitative 
variate derived from a survey with constant sampling fraction, and the error 
variance per sampling unit is constant. The standard error of any mean then 
depends only on the value of the sampling fraction, the error variance, and the 
number of units on which the mean is based. These numbers are likely in 
any case to be of interest in themselves and to be presented either directly or 
in raised form as estimates of the numbers of units in the different parts of the 
population. Alternatively, if the actual numbers of units in the different parts 
are known these may be presented. 

Whatever the exact form of presentation an auxiliary table can easily be 
prepared by the use of formula 7.2, or its analogue for stratified sampling, 
giving the standard errors corresponding to different numbers in the population 
or sample. If a table is felt to be unduly elaborate the formula on which it is 
based may be presented. If the standard errors are likely to be used mainly 
for testing the differences between different means the correction for finite 
sampling can be omitted, with of course a note to this effect. It may be noted 
that with a table of this type the standard error of the difference of two means 
based on numbers % and n 2 can be obtained by entering the table with the 
number ', given by 1/n' l/n t + Ijn 2 . 

Another simple case is that in which the summary table relates to qualitative 
data based on the proportion of units possessing a given attribute. If the 
sample is random the standard error of any entry depends only on the sampling 
fraction, the proportion of units possessing the given attribute in the part of 
the population under consideration, and the number of units. If the variation 
in the proportion of units is not large in the different parts of the population, 
a table or formula based on the proportion in the whole population may be 
sufficiently accurate. If the proportion of units possessing the attribute is 
small in all parts, q can be taken equal to unity. 

An example of the use of an empirical law is provided by the case in which 
the error variances are approximately proportional to the magnitude of the 
totals of the different parts, which, as pointed out in the last section, is likely 
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to hold for certain types of area survey. In this case a formula or table relating 
the errors to the entries of a table of totals can be presented. 

There is not space here to discuss more complicated cases, which must 
be dealt with on their merits as they arise. With the more elaborate types 
of sampling the possibilities for presenting the standard errors in the form 
of auxiliary tables are more limited, but even in such cases it is often possible 
to summarize the standard errors in the form of a few relatively simple formulae, 
suitable for rapid calculation on a slide rule. 



2*5 



CHAPTER 8 
EFFICIENCY 

8*1 General remarks 

The methods described in Chapter 7 enable the sampling error associated 
with a sample of a given type and size to be calculated from the data furnished 
by the sample itself. When planning a sample census or survey, we have 
to solve the more general problem of calculating the sampling errors of samples 
of various types and sizes from the data furnished by a sample of a particular 
type and size. We can then determine which method of sampling is likely 
to be most efficient and the size of sample necessary to give the required 

accuracy. m 

The determination of the sample size in the case of a random sample from 
'a large population has already been discussed in Section 4.31. It was there 
shown that, for qualitative characters which are attributes of the sampling 
units, the number of units required could be determined without any prior 
knowledge of the material other than the approximate proportion of ^ units 
possessing the given attribute in the population ; and that for quantitative 
characters knowledge of the standard deviation of the character in question 
per sampling unit was all that was required. 

The formula of Section 4.31 apply when the population is large 
relative to the size of sample required. If the population is not large a 
correction must be made to allow for finite sampling. This is most simply 
done by calculating the number of units n that would be required if the 
population were large, and the corresponding sampling fraction / 
The required sampling fraction is then given by 



In this calculation f may be greater than unity. 

The method followed in Section 4.31, i.e. that of taking the appropriate 
formula for the standard error of a sample of size n and rewriting this 
formula to give an equation for n, is a general one and can be applied 
to the more complicated types of sample, using the appropriate formulae 
for the standard errors given in Chapter 7. It is apparent, however, that 
these formula; can only be used if the relevant variances per sampling 
unit are known or can be estimated. In certain cases, also, the formulas 
cannot conveniently be rearranged so as to give n directly. This, however, 
is a minor point, since the required solution can always be quickly found by 
trial once estimates of the relevant variances are available. 

In the following sections we will discuss the problems that arise in the 
estimation of the variances relevant to different types of sample when the 
basic data consist of a sample of a different type. In certain cases data relating 
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to all the units of a population will be available. This situation does not differ 
in any essential particulars from that in which data are derived from a random 
sample of the population. 

We will here define the sense in which we shall use various terms in the 
subsequent discussion. 

The relative accuracy of two samples which differ in respect of method 
of sampling or size of sample, or both, may be defined as the reciprocal of the 
ratio of the sampling variances of the estimates provided by them. 

The relative precision of two different methods of sampling based on the 
same type of sampling unit may be defined as the reciprocal of the ratio of the 
sampling variances of the estimates given by the two methods when the same 
number of units are taken. 

The relative efficiency of two different methods of sampling based on the 
same type of sampling unit may be defined as the reciprocal of the ratio of the 
numbers of units required to attain a given accuracy with the two methods. 

In the case of a random sample from a large population, or a stratified 
sample with fixed strata from such a population, the relative efficiency is equal 
to the relative precision. But if the size of the strata depends on the number 
of units in the sample, or if the population is not large relative to the size of the 
sample, there is a difference between the two concepts. 

The term efficiency is already in current use in the theory of estimation. 
It is there used in an absolute sense. An estimate is efficient (i.e. has an efficiency 
of 100 per cent.) if in large samples it is one of the class of most accurate 
estimates, i.e. estimates with minimum variance. An estimate has an efficiency 
of x per cent, if it has 100/# times this minimum variance. This use of the 
term is analogous to precision in our terminology. The reason why no distinction 
has to be made between precision and efficiency in the theory of estimation is 
that only large populations are normally under consideration, in which case 
the two concepts are synonymous. Since no confusion is likely to arise, we 
shall continue to use the term efficiency when discussing the relative accuracy 
of different estimates derived from the same sample. 

The concepts of relative precision and relative efficiency may be extended 
to cover methods of sampling based on different types of sampling unit, by 
replacing numbers of units by the amount of material included in the sample. 
They may be further extended to cover the relative accuracy for a given cost 
and the relative cost for a given accuracy. 

It may be noted here that the relative precision and relative efficiency of 
different types of sampling should as far as possible be judged from estimates 
of the sampling variances derived from the same set of data. Comparisons 
based on estimates derived from independent samples of different types are 
subject to errors of estimation which are considerably larger, and comparisons 
based on samples from different aggregates of similar material are even more 
subject to uncertainty. No very general conclusions should, however, be 
drawn from a single comparison based on a small amount of data, even when 
a single set of data is used. The relative precision of stratified and random 
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samples, for instance, will depend on the differences between strata, and these 
differences may vary considerably even in apparently similar material. 

8.2 Qualitative data 

If the variates under consideration are attributes of the sampling units, the 
effect of stratification, with either uniform or variable sampling fraction, can 
be determined from a knowledge of the proportions of units possessing the 
given attribute in the different strata. In other cases qualitative variates must be 
treated similarly to quantitative variates, as in the estimation of sampling errors. 

Formulas for the required size of a stratified random sample with uniform 
sampling fraction, analogous to those for a random sample given in Section 4.31, 
can be written down without difficulty. A somewhat simpler approach, however, 
is to estimate the percentage standard error of a stratified sample of any 
convenient size (e.g. the size of the sample of which the data are available) 
on the assumption that the population is large. The size of sample required 
to give any predetermined percentage standard error is then given, if the 
population is large, by the formula 

Size of sample required __ (Actual percentage standard error) 2 
Size of actual sample (Required percentage standard error) 2 

Allowance for the effect of finite population size can then be made by 
formula 8.1. 

In the case of a stratified random sample with variable sampling fraction 
the same procedure can be followed, with the exception that allowance for the 
effect of finite population size cannot be made in the above manner. If, 
therefore, any of the correction factors (1 ft) are sufficiently large to be of 
importance, the approximate size of sample required may first be calculated 
as above and the final size found by trial. Variable sampling fractions, however, 
are not likely to be much used for qualitative data. 

Example 8.2. a 

If a large population of individuals is divided into five strata containing 
equal numbers of people, determine the relative sizes of a stratified and a fully 
random sample of the same accuracy when the percentages of individuals 
giving a positive answer to a given question in the different strata are (a) 70, 
60, 50, 40 and 30 per cent, (&) 10, 7J, 5, 2J and per cent. 

A sample of 500 people will have 100 in each stratum. The variance of 
the number in the sample giving positive answers will be, in case (a), for a 
stratified sample, 

100x-7x*3+100x-6X'4:+100x-5x-5+100X'4:X-6+100x-3X'7-=115 
and for a random sample, 

500 x -5 X -5 = 125 
24$ 
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The ratio of the required sizes is therefore 125/115 = 1*087, i.e. the random 
sample will have to be 8-7 per cent, larger. In case (b) a similar calculation 
shows the random sample will have to be 2-7 per cent, larger. 

Example 8.2.b 

Determine from the data of Examples 6.5 and 7. 6. a the numbers of 
farms required to give a sampling standard error of 5 per cent, in the estimate 
of the number of farms growing wheat (a) when the sample is random, (b) when 
it is stratified by size-groups. 

(a) We have p = 54/125 = 0432. Consequently, from Sections 4.31 
and 8.1 

10,000 X 0-568 
H = 0-432 x 5* ^ 526 
/ = 526/2496 = 0-210 
0-210 

'-rf04ib-' m 

n =433 

(b) We have already found that for the stratified sample U = 1080 and 
S.E. (U) = rt 71-64. If the population had been large, therefore, we should 
have had S.E. (U) = 71-64/V(l - 1/20) = 73-5. Consequently in this case 
S.E. % (U) = 6-80. Hence 



125 "~ 5 2 
n Q = 231 

/ = 231/2496 = 0-0927 
/ = 0-0927/(I + 0-0927) = 0-0848 
n =212 

The standard error of the total estimated from a random sample of 
125 farms is 20V(125 X 0-432 x 0-568 X 19/20) = 108-0. Consequently the 
relative precision of the stratified and random samples of 125 units (or indeed 
of any number of units) is given by 108-0 2 /71-64 2 = 2-27. The relative 
efficiency, when a 5 per cent, standard error is required, however, is 
433/212 = 2-05. The relative efficiency is slightly less than the relative 
precision because we are sampling from a finite population. 

8.3 Random sample and stratified sample with uniform sampling 
fraction 

The general principle to be followed is to construct an analysis of variance 
which corresponds as closely as possible to that appropriate to the required 
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type of sample. The procedure varies somewhat according to the type of data 
available. 

(a) From the data of a stratified sample with uniform sampling fraction : 

An analysis of variance within and between strata in the form of Table 
7. 7. a must be made. The within-strata mean square s^ gives an estimate 
of the error variance per unit in a stratified sample, and the mean square s 2 
from the total line gives a similar estimate for a random sample. If separate 
estimates of the error variance per unit have been made for the different strata, 
as in Example 7. 6. a, a pooled within-strata sum of squares may be calculated 
by multiplying the within-strata error variances by the degrees of freedom 
ni - 1 for each stratum, and summing the products, or by summing the sums 
of squares directly. 

The formula of Section 4.31 can then be used to determine the size of 
sample, using s-f in place of s 2 for a stratified sample, and correcting for finite 
population in the same manner as in Section 8.1. Since for a stratified sample 
V(y) = s x 2 (l -/)/, and for a random sample V (y) = s 2 (1 /)/, the 
relative precision of stratified and random sampling will be given by the ratio 
of $ 2 /$! 2 . The relative efficiency will be somewhat less than the relative precision 
when the corrections for finite sampling are appreciable. 

This procedure is approximate in two respects. In the first place, if the 
variances within the different strata are unequal they do not enter into the 
mean square B with quite the correct weights,- as already explained in Section 7.7. 
In the second place, a stratified sample has a slightly greater overall variance 
per unit than a random sample from the same population, and consequently 
C is not the best estimate of the variance per unit of a random sample. Neither 
of these approximations gives rise to errors of any importance in the comparison 
of a random and a stratified sample, but it may be noted that the bias in C 
can be almost completely eliminated by calculating s 2 from the formula 



(8.3) 

An extension of this formula is of use in the case of multiple stratification 
(Section 8 . 4), Method (c) below takes account of both sources of disturbance. 

(b] From the data of a random sample : 

An analysis of variance within and between strata can be made in the same 
manner as with a stratified sample with uniform sampling fraction, and s x 2 
and s 2 can be estimated as in a stratified sample. 

For this procedure it is only necessary that the units of the sample be 
classified by strata. The numbers of units of the whole population falling in 
the different strata do not require to be known. 

If these numbers are known, Method (c) below can be followed. This 
will give slightly more accurate results at the cost of a little additional 
computation, since allowance is made for the fact that the numbers in the 
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different strata In the sample will not be exactly proportional to the numbers 
in the population, owing to the fluctuations of random sampling. 

(c) From the data of a stratified sample with a variable sampling fraction 
(or any arbitrary values of the sampling fractions) : 

Estimates of the average within-strata mean square s 2 and of the overall 
mean square must be calculated from the proportions hi = N//N of the units 
of the population in the different strata. The formulas are 



s* = 5l 2 + s hiyf - y 2 - S hi (1 - A/) sffa 

where y is the estimate of the population mean derived from the sample, and 
is consequently equal to E hi yi. The relation of these formulae to the analysis 
of variance of a stratified sample with uniform sampling fraction will be 
apparent. The terms involving y in s z correspond to the between-strata 
component of variance, the last term of s 2 being the correction required because 
the yi are themselves subject to sampling error. This correction will be trivial 
except when the between-strata component of variance is small and there are 
a large number of strata with few units from each stratum. If the hi are put 
equal to m/n (uniform sampling fraction), s 2 will be the same, to order !/, 
as that given by the mean square C of Method (a) 9 with the exception that in 
Method (0) S htyp y 2 is multiplied by a factor n/(n 1). 

It will be noted that the data need not be derived from a sample in which 
the sampling fractions are chosen with the object of obtaining the most accurate 
possible estimates : any set of data in which the sampling is random within 
strata, and from which the proportions of the units in the different strata, 
the strata means and the within-strata variances can be determined with sufficient 
accuracy, will be adequate. 

Example 8. 3. a 

Determine the error variances per unit and the relative precision of a stratified 
random sample with uniform sampling fraction and a fully random sample 
from the data on wheat acreages of the stratified random sample of Hertfordshire 
farms (Examples 6.5 and 7. 6. a). 

The analysis of variance is given in Table 8. 3. a. The within-strata sum 
of squares is obtained directly from Table 7. 6. a by summing the column 
Si (y j7 f )2. The between-strata sum of squares is obtained by summing 
the products of the columns of Table 6 . 5 . b giving the totals and means, and 
deducting the product of the general total and the general mean. These 
means should be taken to two and three decimal places respectively. We 
thus have s-f = 349-2 and s 2 = 797-1. The estimate of the relative precision 
is therefore 2*28. 
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TABLE 8. 3. a ANALYSIS OF VARIANCE OF THE STRATIFIED RANDOM SAMPLE OF 

HERTFORDSHIRE FARMS 





Degrees 
of freedom 


Sum 
of squares 


Mean 
square 


Between size-groups . 


5 


57,278 




Within size-groups . 


119 


41,558 


349-2 


Whole sample . 


124 


98,836 


797-1 



Example 8.3.b 

Make similar estimates to those of Example 8. 3. a, using the data of the 
random sample (Examples 6.6, 7.2.b and 7.6.b). 

The analysis of variance is given in Table 8.3.b. If the N/ are not 
known, we have sf = 488-5 and $* = 1329*9. The estimate of the relative 
precision is therefore 2*72. 

TABLE 8.3. b ANALYSIS OF VARIANCE OF THE RANDOM SAMPLE OF HERTFORDSHIRE 



FARMS 





Degrees 
of freedom 


Sum 
of squares 


Mean 
square 


Between size-groups . 
Within size-groups . 


5 

119 


106,775 

58,129 


488-5 


Whole sample . 


124 


164,904 


1,329-9 



If the N/ are known the calculations follow the same lines as those of 
Example 8 . 3 . c below, and are left to the reader. In this case we find s^ = 436 -5 
2 _- 1189-2, the estimate of the relative precision being again 2-72. 



Example 8.3. c 

Make similar estimates to those of Example 8. 3. a, using the data of the 
sample with variable sampling fraction (Examples 6.7 and 7. 7. a). 

Table 8 . 3 , c shows the calculations. The hi are calculated from the numbers 
in the population. These are given in Table 6,6.b, except for the last two 
size-groups, which have the values 215 and 51 respectively. It will be noted 
that we are here considering a sample stratified for districts as well as size-groups. 
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TABLE 8.3.c CALCULATION OF THE AVERAGE WITHIN-STRATA AND OVERALL 

MEAN SQUARES FROM THE STRATIFIED SAMPLE WITH VARIABLE SAMPLING 
FRACTION OF HERTFORDSHIRE FARMS 



Size-group 


hi 


*>i 


y* 


y<* 


*i* 


A,(l-A<)s,/*< 


1- 5 


174 





(0) 


(0) 


(0) 


(0) 


6- 20 


208 


3 














21- 50 


143 


6 


4.5 


20 


53-5 


1-1 


51-150 


208 


26 


8-2 


67 


159-2 


1-0 


151-300 


160 


40 


29-1 


847 


564-2 


1-9 


301-500 


086 


43 


76-6 


5,868 


1,703 


3-1 


501- 


020 


17 


172-1 


29,618 


2,614 


3-0 




999 


135 


17-03 


1,249-3 


329-8 


10-1 


301-500 


811 


43 


76-6 


5,868 


1,703 


6-1 


501- 


189 


17 


172-1 


29,618 


2,614 


23-5 




1-000 


60 


94-65 


10,357 


1,875 


29-6 


301- 


106 


60 


94-6 


8,959 


3,243 


5-1 




999 


135 


17-03 


1,102-0 


474-8 


9-1 



The sums of the products of hi with j;/, yf and sf are shown at the foot 
of their respective columns. We therefore have, since 17-03 2 = 290-0, 
^2 = 329-8 
* 2 = 329-8 + 1249-3 290-0 10-1 = 1279-0. 

Hence the relative precision is 1279-0/329-8 == 3-88. It will be seen that the 
corrections in the last column are here trivial, and could well be omitted. 

Size-groups 301-500 and 501- can be combined in the manner shown in 
the second part of Table 8.3.c. We have, for these two size-groups combined, 
5 2 = 1875 + 10,357 8959 30 = 3243 

We can now insert a fresh line in Table 8.3.c to replace the lines for the 
last two size-groups in the first part of the table. The previous computation 
is then repeated, giving 

^ 2 = 474-8 
s* = 474-8 + 1102-0 290-0 9-1 = 1277-7 

Hence the relative precision is 2-69. 

The amalgamation of the two size-groups containing the largest farms 
has resulted in a considerable loss of precision, the relative precision being 
329-8/474-8 = 0-69. 
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8.4 Multiple stratification 

The gain in precision due to sub-stratification of a sample which is already 
stratified into main strata can be estimated by methods similar to that of 
Section 8.3. An example has already been given in Example 8.3.C, where 
the gain in precision resulting from the subdivision of the size-group 301- 
into two groups, 301-500 and 501- was determined. 

If the data are derived from a sample with uniform sampling fraction 
which is itself sub-stratified, the comparisons can be made directly between 
the relevant mean squares in the analysis of variance, as in Method (a). The 
structure of the analysis of variance in this case is 
"Between main strata 



Whole sample (s 2 ) ___. , . . . . f ~ Between sub-strata 
* Within mam strata (,,) - 



sub-strata 

The ratio of the mean squares s^ and s 2 2 within main strata and within sub- 
strata will give the required relative precision, 

A similar analysis can be constructed from data derived from other types 
of sample with uniform sampling fraction (Method (&)). 

One case of practical importance is that in which both the main and 
sub-strata are arbitrary subdivisions of an area, all the main and all the sub- 
strata being of equal size. If there are t' main strata, and t" sub-strata per 
main stratum, with k selected sampling units per sub-stratum, the analysis 
of variance will be of the form shown in Table 8.4. 

TABLE 8.4 STRUCTURE OF THE ANALYSIS OF VARIANCE IN A 

DOUBLE STRATIFICATION 



Between main strata 

("Between sub-strata 
Within main strata 1 Within sub-strata 



Total for sample 



[Total 



Degrees Mean 

of freedom square 

? (*" - 1) ? 



t' t" (* - 1) 
t' (t" h - 1) 
t>t"k-l 



B = 



If k is small the bias in the estimate s^ provided by the within-strata^mean 
square may be appreciable. This bias can be almost completely eliminated 
by using the formula 

*! = *" A -1)JS 



which is derived directly from formula 8.3. 

8.5 Stratified sample with variable sampling fraction 

In the notation of Section 8.3, we have 

N V (y) 2 a A/ (!-//)//* 

The m or // required for a given accuracy can only be determined uniquely 
from this equation if the relations between the different // have been decided. 
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It has already been pointed out (Section 3.5) that for maximum accuracy the 
ft should be proportional to a/, but that in many types of material stratified 
by size-groups the // may be taken proportional to the mean sizes of the size- 
groups. If we put ft = c A/, where the A* are in the required proportions, 
the above equation can be written 

- S (sP fe/Ai) = N V (y) + sP hi 

The value of c for any required accuracy can then be calculated. If, however, 
a value of c is obtained which makes some of the/* greater than 1 the calculation 
must be repeated, omitting the terms for these strata from both sides of the 
equation. 

Alternatively the direct expression for V (y) can be used and the value of c 
found by trial. This has the advantage that the effect of adjustments of the 
final sampling fractions to simple fractions is immediately apparent. 

The relative precision of stratified samples with variable and with uniform 
sampling fractions can be obtained by calculating V (y) for both samples. 
It should be noted that if the // have been taken proportional to the Si a slight 
over-estimate of the relative precision will be obtained, owing to errors in the 
Si. This point has been discussed by Sukhatme (1935, A), but is not of great 
importance in practice. 

It will be seen that for these calculations we only require sufficiently accurate 
estimates of the variances within strata and the proportions of units of the 
population in the different strata. The procedure is therefore the same whether 
or not the sample from which the data are obtained is stratified. All that is 
required is that all strata should be adequately represented,* 

Example 8. 5. a 

From the data of Table 8 . 3 . c determine the size of sample required to give 
a standard error of 1500 acres in the estimate of wheat acreage, when 
sampling fractions proportional to those of Table 3. 7. a are used. 

The Ai can be taken equal to the sampling fractions of Table 3. 7. a. 
Tabulating sp hi and spht/fa, we find 

S (sP ht/h) = 2913 S sp hi = 329-77 

Also N V (y) = V (Y)/N = 1500 2 /2496 = 901 -44, and hence 
c = 2913/{901-44 + 329-77} = 2-37 

The total number required in the sample is therefore 135 X 2*37 = 320, 
the number in the largest size-group, for example, being 51 x 237/3 = 40. 
No sampling fraction is greater than 1, and therefore no further computation 
is required. 

In practice the new sampling fractions may well be rounded off, taking, 
for example, all of the largest size-group, ^ of the next, etc. 

* The case in which domains cut across strata is discussed in Section 9.5* 
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The direct approach illustrates the way in which results of this kind can 
readily be obtained by interpolation. A first approximation to the number 
required Is given by 135 X 2550 2 /1500 2 = 390. The standard error 
corresponding to this sample number, obtained by the ordinary methods, is 
1302. The squares of the reciprocals of this and of the original standard 
errors can be plotted against the respective numbers In the samples and a 
smooth curve drawn through these two points and the origin. This curve 
gives the general relation between sample size and accuracy, and will be found 
to give a sample number corresponding to a standard error of 1500 of 
approximately 320. 

Example 8.5.b 

Determine the relative precision of the sample of Table 3. 7. a and the 
sample with sampling fractions proportional to containing the same number 
of farms. 

The standard error, using these sampling fractions, can be calculated in 
the ordinary manner, and is found to be 2420. The relative precision is 
therefore 2420 2 /2550 2 = 0-90. There is consequently an apparent loss of 
precision of approximately 10 per cent., but the real loss is likely to be less 
than this, owing to errors in the estimates of the standard errors. 

This apparent loss refers to a single variate, acreage of wheat. If, for 
instance, the acreage of some other crop were taken, the a* would be different 
and the sampling fractions required to give minimum variance would therefore 
also be different. Consequently, if several variates have to be determined, 
a compromise will in any case be required. 

8.6 Supplementary information 

The determination of the number of units required in a sample when 
supplementary information is available presents no essentially ^new problems. 
It has been shown in Chapter 7 that apart from the substitution of sf or ^ 2 
for s 2 the formulse for the variances of estimates based on supplementary 
information differ little from those for estimates from similar samples without 
supplementary information. Consequently it will usually be sufficient^ to 
estimate the appropriate variance by the methods given in Chapter 7, using 
this variance instead of the ordinary variance per unit to determine the size 
of sample* The factor x 2 /* 2 in the variance of the ratio estimate differs from 
unity only because of sampling fluctuations in x, and can be omitted. 

When the ratio method is to be used and V* (r) is virtually constant for all x, 
it will often be advantageous to estimate this variance rather than $ q *. This 
will generally lead to somewhat simpler and more straightforward computations. 
Any slight bias introduced into the estimates of error will be of little consequence, 
since it will merely result in a slightly larger or smaller sample being taken. 
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We frequently require an estimate of the gain in precision due to the use 
of supplementary information. This is needed in planning a sample survey 
when a decision has to be reached whether supplementary observations should 
be taken. It is also required in the planning of the computations in order to 
decide whether the utilization of available supplementary information is worth 
the additional computational labour. 

In the case of the regression method the relative precision is very simply 
calculated, since it depends only on the value of the correlation coefficient r, 
being in fact 



1 -r 2 

In calculating r due regard must be had to any restrictions imposed by 
stratification, the same sums of squares and products being used as in the 
calculation of the regression coefficient and the residual error. The above 
expression is approximate in that the reduction by 1 of the error degrees of 
freedom with the regression has been ignored, but this correction will be small 
relative to errors in the estimation of r. 

If an arbitrary value b of the regression coefficient is used the relative 
precision will be 



The corresponding expression for the ratio method is obtained by writing 
r for 6 . 

Example 8.6. a 

From the data of Example 7.17 calculate (a) the number of farms required 
to give an unbiased estimate of the mean dressing of nitrogen per acre over 
the farms of the county with a standard error of 0-05 cwt., and (b) the 
number of farms required in each of two equal groups so that the comparison 
based on the unweighted means of the dressings per acre of the two groups 
has a standard error of 0*05 cwt. 

(a) The required number is 67 X 0-0392 2 /0-05 2 = 41. The correction for 
finite sampling is trivial in this example. Note that either $ q * or s r 2 can be 
used to arrive at this result. 

(b) If the required number in each group is n, the variance of the difference 
of the means is 2 s^jn. Hence n = 2 x -0541/0 -05 2 = 43. 

Example 8.6.b 

Obtain the expressions for the relative efficiencies given in Example 7 -12.b 
from the above formulae. 

We have b = 0-6327, r = 52,069/V(U6,266 X 82,296) = 0-537 and 
r = 147-36/132-08 = 1-116. Hence the relative precision of the regression 
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method, compared with the sample plots only, is 1/(1 0-537 2 ) = ] 40, 
Similarly the use of differences (b = 1) gives a relative precision of 
1/{1 0-537 2 + 0-537 2 (1 - 1/0-6327) 2 } = 1-23, the ratio method (6 = 1-116) 
gives a value of 1-14, and the regression (b = 0-55) gives a value of 1 -39. These 
correspond to the relative efficiencies already tabulated except in the case 
of the regression, for which we have here neglected the correction for degrees 
of freedom. 

Example S.B.c 

Determine the gains in precision in the estimation of wheat acreages from 
the random sample of Hertfordshire farms due to the use of supplementary 
information on acreages of crops and grass, (a) using the ratio method, and 
(b) using the regression method, without taking account of districts. 

The standard errors, already obtained, are 7950 for direct estimation 
without the use of supplementary information (Example 7.2.b), 3940 for 
the ratio method (Example 7. 8. a), and 4126 for the regression method 
(Example 7. 12. a). The apparent gain in precision due to the ratio method 
is therefore 7950 2 /3940 2 = 4-07, and that due to the regression method is 
7950 2 /4126 2 = 3-71. 

The value for the regression appears anomalous, since the formulae given 
above indicate that regression may be expected to be at least as efficient (apart 
from the change in degrees of freedom) as the ratio method. The discrepancy 
is due to the inclusion of the factor x 2 /# 2 in the variance of the ratio estimate. 
Using the above formulas with r = 0-8555, b = 0-1932, r = 0-1522, we find 
that the relative precision, compared with direct estimation, is 3 '73 for the 
regression method, and 3*32 for the ratio method. An alternative estimate of 
the relative precision of the regression and the ratio methods is therefore 
3-73/3-32 = 1-12. This latter value gives a better indication of the average 
value of the relative precision of the two methods. 

8.7 Two-phase sampling 

The only case which presents any new features is that in which the first- 
phase information is used as supplementary information to improve the 
accuracy of estimates of the second-phase variate^y. It has already been pointed 
out in Section 7.8 that the variance of a two-phase sample is in this case made 
up of two parts A and B, where 

A = variance due to the first-phase sampling, i.e. the variance which would 
be obtained if y were determined for all the units of the first-phase 
sample, 

B = variance due to the second-phase sampling of the first-phase sample 
(regarded as without error). 
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To determine B the methods given in Chapter 7 for supplementary 
information are followed, the effective sampling fraction being njn^ To 
determine A we must use the methods given in the present chapter for the 
evaluation of the error of a sample of one size and type from the data of a 
sample of a different size and possibly different type. Thus, if the first-phase 
sampling is random, and the second-phase sampling is stratified with a variable 
sampling fraction, it is necessary to calculate the variance of an unstratified 
random sample of n-^ units from the data of a stratified sample with variable 
sampling fraction of n 2 units. 

Once A and B have been determined the calculation of the relative precision 
of different possible sampling methods presents no difficulty. If, for example, 
we wish to ascertain the increase in precision due to taking a two-phase sample 
of % and 72 2 units instead of a single-phase sample of n 2 units, we calculate 
what the variance A' of a sample of n 2 units would be if the first-phase sampling 
procedure were followed for a sample of n z units. This calculation will follow 
the same lines as that of A. The relative precision is then A' /(A + B). 
Similarly the relative precision resulting from the ascertainment of the second- 
phase information on the 2 second-phase units only, instead of on all the n 
units of the sample, will be Aj(A + B). 

In the simple but general case in which the population is large, and the 
methods of sampling and estimation are such that the variances of the estimates 
at each phase are inversely proportional to the numbers of units, apart from 
the factor 1 n z /n l9 the above relative precisions are capable of simple 
expression. If the effective variances per unit are s^ and $ 2 2 , with s^ = K 
and n^n-L = A, we have 



c 2 



Consequently the relative precision giving the gain due to the inclusion of the 
additional first-phase units is 

A' 1 



Similarly the loss by not ascertaining the second-phase information over all 
the first-phase units is given by the relative precision 

A ___ A 
A + B ~ (1 - X) K 2 + A 

Representative values of these fractions are given in Table 8.7. If, for 
example, the effective standard error per unit is halved by the use of the first- 
phase supplementary information, /c 2 = J. Consequently, if we introduce 
two-phase sampling and quadruple the size of the sample for first-phase 
information only, instead of using single-phase sampling, the amount of 
information derived from a second-phase unit is increased by a factor of 2-29. 
Similarly by collecting second-phase information on only J of the first-phase 
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units instead of all the units the amount of information is reduced by a factor 
of 0-57. 

TABLE 8.7 RELATIVE PRECISION OF TWO-PHASE AND SINGLE-PHASE SAMPLING 



Two-phase sample : 
Single-phase sample : 


% and w a 


tii and 2 


JC* 


4 


i 1 


* i * 


x-t 


1-33 


1-8 1-78 


0-67 0-8 


0-89 


xt 


1-8 


2-29 2-91 


0-4 0-57 


0-73 


x-t 


1-78 


2-91 4-27 


0-22 0-36 


0-53 



8.8 Sampling on successive occasions 

The relative efficiency of the various estimates can be calculated from the 
variances given in Section 7 . 19. When the variances on the different occasions 
are the same, the relative efficiency of the various estimates, under the conditions 
set out in Section 6.22, depends only on /* and the correlation r between the 
successive occasions. 

Table 8. 8. a gives the efficiencies, relative to those of the overall mean, 
of the adjusted estimates of the mean on the last occasion (a) when there is 
a sub-sample on the second occasion, and (i) with partial replacement, the 
latter being given for both two and a large number of occasions. Values for 
fji = and fi = i, and for various values of r, are given. With independent 
samples or a fixed sample the overall means are fully efficient. 

TABLE 8.8.aSAMPLiNG ON SUCCESSIVE OCCASIONS : EFFICIENCY, RELATIVE TO 

THE OVERALL MEAN, OF THE ADJUSTED ESTIMATES OF THE MEAN ON THE LAST 
OCCASION 





r*-* 


H J 


r 




Partial replacement 




Partial replacement 




Sub- 




Sub- 






sample 


Two 
occasions 


Large 
number 


sample 


Two 
occasions 


Large 
number 





1-00 


1-00 


1-00 


1-00 


1-00 


1-00 


25 


1-03 


1-02 


1-02 


1-02 


1-02 


1-03 


5 


1-14 


1-07 


1-08 


1-09 


1-06 


1-07 


6 


1-22 


1-11 


1-12 


1-14 


1-09 


1-11 


.7 


1-32 


1-16 


1-20 


1-20 


1-13 


1-18 


8 


1-47 


1-24 


1-33 


1-27 


1-18 


1-30 


9 


1-68 


1-34 


1-65 


1-37 


1-25 


1-59 


95 


1-82 


1-41 


2-10 


1-43 


1-29 


2-02 


1-0 


2-00 


1-50 


Inf. 


1-50 


1-33 


Inf. 
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TABLE 8.8.b SAMPLING ON SUCCESSIVE OCCASIONS: EFFICIENCY, RELATIVE TO 

THE DIFFERENCE OF THE OVERALL MEANS, OR TO INDEPENDENT SAMPLES 

(VALUES IN BRACKETS), OF ALTERNATIVE ESTIMATES OF CHANGE 



r 


t*~i 


P-i 


Fixed 
sample 


y - YH - 1 


From last 
two occasions 


y * - y - 1 


From last 
two occasions 





1-00 (1-00) 


1-00 (1-00) 


1-00 (1-00) 


1-00 (1-00) 


(1-00) 


25 


1-02 (1-16) 


1-02 (1-17) 


1-02 (1*22) 


1-02 (1-22) 


(1-33) 


5 


1-10 (1-47) 


1-12 (1-50) 


1-09 (1-63) 


1-11 (1-67) 


(2-00) 


6 


1-18 (1-69) 


1-22 (1-75) 


1-15 (1-92) 


1-20 (2-00) 


(2-50) 


7 


1-32 (2-03) 


1-41 (2-17) 


1-27 (2-37) 


1-36 (2-56) 


(3-33) 


8 


1-60 (2-67) 


1-80 (3-00) 


1-50 (3-22) 


1-71 (3-67) 


(5-00) 


9 


2-43 (4-41) 


3-02 (5-50) 


2-21 (5-53) 


2-80 (7-00) 


(10-00) 


95 


3-99 (7-61) 


5-51 (10-50) 


3-58 (9-76) 


5-01 (13-67) 


(20-00) 



The increase in precision due to the use of partial replacement instead 
of independent samples or a fixed sample can also be obtained from Table 8 . 8 . a. 
Thus with a correlation of 0-8 replacement of half the units gives a 24 per cent, 
increase in precision on the second occasion and a 33 per cent, increase after 
a number of occasions. With one-third replacement the corresponding 
percentages are 18 and 30. 

Table 8.8.b gives similar efficiencies, relative to the differences of the 
overall means, or to independent samples (values in brackets), of the estimates 
of change given by y h Yh - 1 an< ^ kj tke weighted estimate based on the last 
two occasions only (formula 6.21.b). 

In the estimation of change the difference between the overall means of 
two independent samples is less accurate than the difference of the overall 
means of a sample with partial replacement. This in its turn is less accurate 
than the difference between the means of a fixed sample. Thus with a 
correlation of 0-8 the weighted estimate from the last two occasions, with 
replacement of half the units, is 3-00 times as efficient as the difference of the 
means of two independent samples, but only 1-80 times as efficient as the 
difference of the overall means of the replacement sample. A repeated sample 
under these circumstances is 5-00 times as precise as a pair of independent 
samples. 

It will be noted that the estimate of change derived from the last two 
occasions is always somewhat more accurate than the estimate /A y& - 1 
With a correlation of 0-8, for instance, there is a gain in efficiency of 12 per cent, 
when /i = | and of 14 per cent, when JLI = J. 
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Example 8.8 

Estimate the relative efficiency of the various estimates of Examples 6.21 
and 6.22. 

In Example 6 . 21, r = 0-847 and p = . Consequently, from Table 8 . 8 . a, 
the relative efficiency of y w and y is 1-21. From Table 8.8.b the relative 
efficiency of the weighted estimate of change and the difference of the overall 
means is about 2-1. The relative efficiency of the difference of the means of 
the units common to both occasions and the weighted estimate is given by the 
weight of the former, namely, 0-929. 

The relative efficiency of the estimates of Example 6 . 22 cannot easily be 
determined exactly, owing to the variation in the numbers of units from occasion 
to occasion. With the average value of JLI, of i and a correlation of 0'811, the 
efficiency of y/z relative to the overall mean after a number of occasions will 
be 1-32 (Table 8. 8. a) and that of the estimate y h y^-i of change relative 
to the difference of the overall means will be about 1-6 (Table 8.8.b}. 

8.9 Sampling with probability proportional to size of unit 

The relative precision of sampling with uniform probability and with 
probability proportional to size of unit depends on the variance laws to which 
the material is subject. The case in which the mean r for fixed x is the same 
for all values of x, and in which the variance of r for fixed x is a function of #, 
may first be considered. 

If the total size of all units is known, we shall be concerned with estimates 
of f . If we put V (#)/x 2 = y we have the results shown in Table 8 . 9 . a for 
the three variance laws there given, v being a constant. 

TABLE 8. 9. a VARIANCES OF r 



Variance of y 
for fixed x 


Variance of r 


Uniform 
probability 


Probability 
proportional to x 


V 


v (1 -f Y)/w 


v/n 


i)l% 


v/nx 


v/nx 


v/x* 


v/nx 2 


v (1 + r)/'^ 2 



In sampling for yield per acre in a crop estimation scheme, for example, 
the variance of the yield per acre may be expected to be about the same for 
large and small fields. If in addition there is no marked difference between 
the mean yields per acre of small and large fields, the precision of sampling 
with probability proportional to size relative to sampling with uniform 
probability will be 1 + 7. 
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If the mean r for fixed x varies with x tne variances of Table 8 . 9 . a will 
be increased, and the precision of either method, or the relative precision of 
the two methods, may best be judged by direct analysis of actual data. 

If the acreage of the crop has to be determined by the sampling of fields, 
the relative precision of sampling with probability proportional to size, and 
with uniform probability, will also depend on the variance of the acreages. 
The simplest case is that in which the sampling is used to determine which of 
the fields carry the given crop, and in which the values of x and V (x) are the 
same for the fields of the given crop and for the remaining fields, the number 
of fields being large. The variance of the proportion p of the total area under 
the given crop when n f fields are taken is in this case pq \ri with sampling with 
probability proportional to size, and pq (1 + y)jn f with uniform probability. 
The relative precision is therefore 1 + y. 

In the case of sampling with probability proportional to size, point 
sampling will often be used. If the part of the land area which consists of 
fields cannot be recognized on the map, additional points will have to be visited 
on the ground, and these must be allowed for in assessing the total number of 
points required. 

In the more complicated cases of sampling with probability proportional to 
size the same general approach as that adopted in the previous sections must 
be followed, using the data provided by an actual sample to determine the 
relevant variances. If the basic data are derived from a sample taken with 
probability proportional to size, Sr 2 can be calculated from the formulas of 
Sections 7.15 or 7.16. The value so obtained may then be used to deduce 
the size of sample required for a given accuracy. 

If the basic data are derived from a sample taken with uniform probability 
of selection, or if data relating to the whole population are available, the various 
sizes of unit will occur in proportions which are different from those of a sample 
taken with probability proportional to size of unit. Consequently a different 
formula is required for the calculation of s r 2 . The appropriate formula for a 
random sample is 



where f = S (y)/S (#). If the individual values of y and r are tabulated the 
second form of the expression is most convenient for computation, 

In the case of a stratified sample the expression within the square brackets 
must be evaluated for each stratum separately. If the number in each stratum 
is small and there is no great difference between the #/, the separate components 
can then be aggregated and divided by (n t) x. If there are considerable 
differences between the xt it is best to calculate s r f separately for each stratum, 
using the separate values in the calculation of V (Y). * 

* The comparative efficiency of sampling with probability proportional to size and 
stratifying by size with variable sampling fraction is discussed in Section 10.10. 
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Example 8.9. a 

From the data of Table 6. 19. a construct a frequency distribution of the 
acreages of sugar-beet fields on old arable land in Norfolk, and hence calculate 
the relative precision of estimates of the mean yield per acre derived from a 
random sample of fields taken (a) with probability proportional to size and 
(b) with uniform probability, on the assumption that the variability of the 
yield per acre is the same for all sizes of field. 

In constructing the frequency distribution, account must be taken of the 
variable sampling fractions at the two stages of sampling. Since the raising 
factors at the first stage are nearly proportional to 7, 4, 2, the fields on the small, 
medium and large farms with a single field of sugar-beet must be counted 7, 
4 and 2 times respectively. Similarly a field occurring on a farm with 2 fields 
of sugar beet must be counted 14, 8 or 4 times, etc. 

This procedure gives the frequency distribution shown in Table 8.9.b 

TABLE 8. 9. b FREQUENCY DISTRIBUTION OF THE ACREAGES OF 

SUGAR-BEET FIELDS 



Acreage 


Raised No. 
of fields 


Acreage 


Raised No. 
of fields 


2 


92 


12 


4 


3 


43 


13 


8 


4 


117 







5 


48 


20 


8 


6 


54 







7 


63 


24 


6 


8 


42 







9 


4 


29 


12 


10 


60 







11 


24 


48 


2 








587 



FoUowing the method of Example 7.1. a for grouped data (the acreages 
being taken as the working units), we find 

x = 6-681, s 2 = V (*) = 30-47, y = 30-47/6-68P = 0-681 
Consequently the relative precision of methods (a) and (b) is 1*68. 

Example 8.9.b 

From the data of the sample of Hertfordshire parishes taken with uniform 
probability (Sample A of Section 3.11) estimate the value of s r 2 for a sample 
of parishes, stratified by districts, taken with probability proportional to size. 
Make a similar estimate from the data for all 93 combined parishes. 

264 



EFFICIENCY 



SECT. 8.9 



The data for the 91 combined parishes are shown in Table 8.9.c, the 
parishes selected for samples A and B being indicated in the table. 

TABLE 8.9.c ACREAGES OF CROPS AND GRASS (DIVIDED BY 10), AND OF WHEAT, 

IN THE 91 COMBINED HERTFORDSHIRE PARISHES 
Dist. C. & G. Wh. Dist. C. & G. Wh. Dist. C. & G. Wh. Dist. C. & G. Wh. 



249 316a 

335 1646 

664 652 

226 192 

256 272 

314 131 

248 26 



3 



283 


612 


247 


624 


205 


356a 


304 


7666 


220 


362 


344 


7016 


237 


567 


204 


5036 


209 


573 


336 


728a 


305 


901 


330 


515a 


344 


788 


226 


434 


220 


506 


345 


838 



264 3860 
208 366 
237 31 Ib 

227 319 
220 238a 
436 54 
214 327 
333 2286 
464 1074 
232 313 
210 98a 
201 407 

265 466 
634 1264 

228 276 

229 249a6 
293 6866 
276 651 
281 503 
273 604 



363 

220 
390 
251 
210 
217 
305 
230 
227 
337 
282 
443 
250 
289 
213 
242 
416 
340 



454 
907 
466 
426 
263 
779 
5586 
440a 
618 
710 
7756 
5180 
495^6 
262 
565^6 
8626 
537 
358 1085 
246 474 
393 776 
267 702 
259 410 
388 862 
258 424 



380 


491 


347 


8186 


363 


741 


405 


582 


337 


586 


371 


442 


294 


416a 


252 


2256 


307 


284a 


374 


7386 


305 


244 


486 


562 


204 


194 


257 


236a 


249 


309 


337 


294 


323 


246 


350 


390 


306 


2906 


380 


244 


272 


237 


384 


318a 


251 


116 


27,30444 


,676 



The parishes selected for samples A and B of Table 3.11.b are indicated by 
the letters a and 6 respectively. 

The values of x, y, and r for district 4 (sample A} are as follows : 



x 
363 

227 
250 
289 
242 



y 

958 
440 
518 
495 
565 



r 

2-6391 
1-9383 
2-0720 
1-7128 
2-3347 



1371 



2976 



2-1707 



Thus we have S (ry) r S (y) = 958 x 2-6391 + . . . 2976 X 2-1707 
*= 161*3. The corresponding values for districts 2, 3 and 6 are 63-9, 116-8, 
and 0-0, with a sum of 342-0. The sample mean of x for these four districts 
is 266-4 and consequently s r 2 = 342-0/(10 X 266-4) = 0-1284, or in acreage 
units 0-001284. 
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This value is considerably less than the value 0-002649 obtained in Example 
7.16. Each estimate, however, is based on only 10 degrees of freedom, so 
that the discrepancy is not exceptionally large. The corresponding value from 
the data for all 91 combined parishes, calculated in the same manner, is 
0-002222. This calculation is left as an exercise for the reader. 

Example 8.9.c 

Compare the relative precision, in the estimation of wheat acreage, of 
samples of Hertfordshire parishes taken with uniform probability and with 
probability proportional to size, by calculating the expected standard errors of 
samples of types A and B of Table S.ll.b. 

The data for all 91 combined parishes give a value of $q* of 23,483 when 
districts are eliminated and the same ratio is taken for all districts, and a value 
of 22,427 when different ratios are taken for the different districts. 

In calculating the expected standard error the formula of Section 7.10 
may be used, so as to allow for the variation in sampling fraction from district 
to district. The factors X/ 2 /{5f (x)} 2 may be replaced by I/// 2 since we are 
considering the average error to be expected over a series of similar samples. 
This will lead to a slight underestimation of the average error. 

We find S (1 - jfe) // ft 2 = 402-83, and consequently V (Y) = 9460 x 10 6 
when the same ratio is taken for all districts, and 9-034 x 10 6 when different 
ratios are taken. 

Similarly, in the case of sampling with probability proportional to size, 
from the results already given in Example 7.16, and the value of s r 2 given in 
Example 8.9.b, we find V (Y) = 8-263 x 10 6 . 

The standard errors corresponding to these variances have already been 
given in Table S.ll.b. 

The relative precision of sampling with probability proportional to size, 
and with uniform probability using a single value of the ratio, is therefore 
9-460/8-263 = 1-14. There is thus a gain in precision of 14 per cent., but 
it must be recognized that sampling with probability proportional to size will 
result in parishes of larger average size being included in the sample. Neglecting 
the disturbance due to the probability being only approximately proportional 
to size, the average size of parish in this case will be given by S (% 2 )/S (x), where 
the summations are taken over the whole population (or a sample selected with 
uniform probability). This gives an average size of 3244 acres of crops and 
grass, compared with the arithmetic mean of 3000 acres, i.e. an average size 
greater by 8 per cent. 

8.10 Interpretation of the analysis of variance 

The analysis of variance can be interpreted in the manner set out below. 
This interpretation is of particular use when we are concerned with multi-stage 
sampling, and with the effect of change of size of the sampling units. 
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If the units fall into groups of any kind, such as strata, the unit values of 
a variate y can be regarded as made up of the sum of two parts, one, &, which 
varies from group to group but has a fixed value for all units of a particular 
group, and the other, v, which varies from unit to unit independently of the 
groups. The variances of u and v may be denoted by U and V respectively. 
Thus u and v may be random sample values from normal distributions, though 
the condition of normality is not necessary. In this hypothetical framework 
zero mean can be assigned to the parent distribution of v without loss of 
generality, but even so the mean of the ^'s for all the units of a finite population, 
or for all the units of a particular group, will not be exactly zero, and consequently 
the group means are not exactly equal to the w's. For this reason the values of 
u and v cannot be uniquely determined from the values of y. 

The mean squares of the analysis of variance provide estimates of U and V. 
If A and B are the mean squares between and within groups, C is the overall 
mean square, k is the number of units in each group, and h the number of 
groups, we have 



Hence 

U = (A - E)lk 

We also have, from the analysis of variance, (hk I) C = h(k T) B 
~j- (h 1) A. Consequently if a 2 is the overall variance and c-f the variance 
within groups we have, from formula 8.3, 

s* = U (A - 1)/A + V 



The factor (h T)/h is analogous to the correction for sampling from a finite 
population. 

The relative precision of stratified and random sampling will be obtained 
by taking the groups as strata. We then have, with t strata, 

s z t-l U 

- 2 =i + . 7 

An alternative formulation is possible in terms of the intra-class correlation^ 
i.e. the correlation between members of the same stratum when the strata 
themselves are regarded as a random sample from an infinite set of similar 
strata (R. A. Fisher, Statistical Methods for Research Workers, Section 40). 
The estimate n of this correlation is given by 

A-B ___ U 
n ~~~A + (k~-l)B~~U + V 
and consequently 

* ri 



Looked at from this point of view, the intra-class correlation coefficient 
may be regarded as a quantitative expression of association which is alternative 
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to the ratio U/V, In this book we shall use the concept of additive components 
of variance, since this appears to be more easily capable of generalization, and 
is otherwise preferable to the concept of intra-class correlation. 

When there is compensation between the different units of the same stratum 
the definition of U as a variance breaks down, and has to be extended (see 
Yates and Zacopanay, 1935, H). Complete compensation occurs when all 
the strata means (or the first-stage units in a two-stage scheme) are equal. 
In this case K U + V = 0, Le. U = ~~ V/K, where K is the number of units 
in each group of the population. Negative values of U between and V/K 
are therefore admissible. 

8.11 Multi-stage sampling 

The sampling variance of two-stage sampling can be divided into two parts, 
A and B, where 

A = variance due to the first-stage sampling when there is complete 
ascertainment at the second stage, i.e. when all the second-stage units 
which go to make up the selected first-stage units are known, 
B = variance due to the second-stage sampling of the selected first-stage 

units. 

Thus the formula of Section 7.17 for V (y) in two-stage random sampling 
may be rewritten 

VH^L^/Vs + i^,"* (S.ll.a) 

*'* n' n f n" 

where 

V 2 = /2 _L^/' Z (s.n.b) 

The first term constitutes part A and the second part B. 

The second term will be recognized as (1 -/")/(! -/) times the variance 
that would be obtained with single-stage sampling of the second-stage units, 
the same total number of second-stage units being taken, with the first-stage 
units as strata and uniform sampling fraction/. If/" is small, therefore, the 
first term gives the increase in variance due to the adoption of the two-stage 
process. 

The above subdivision is alternative to that given in Section 7.17. Part A 
is dependent only on the first-stage sampling, being unaffected by the intensity 
or type of sampling at the second stage. This fact considerably simplifies the 
problem of determining the sampling errors for different intensities of sampling 
at the two stages : with the subdivision of Section 7.17 the variation in / 2 
for different intensities of sampling at the second stage has to be taken into 
account. 

The only new point that arises in the estimation of the relevant variances is 
the determination of part A from the data of a two-stage sample. In general 
this simply requires that the variance per first-stage unit due to the second-stage 
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sampling of the first-stage units be deducted from the variance per first-stage 
unit calculated from the sample. Thus for a two-stage random sample formula 
8.11.b is used. 

It is often helpful to carry out an analysis of variance on data derived from 
a two-stage sampling process. The situation is simplest when the number 
of sampled second-stage units n! r in each first-stage unit is the same. Each 
stage of the analysis then follows the same pattern as the analysis of a single- 
stage sample of the same type. At the first stage, however, the values entering 
into the analysis must be either the means or the totals of the second-stage 
unit values. It is customary (though not essential) to tabulate the sums of 
squares of the first stage in terms of the second-stage units. If the first-stage 
unit means are used, therefore, all sums of squares at the first stage must be 
multiplied by n" 9 while with totals all sums of squares must be divided by n". 

In the case of two-stage random sampling, for example, the degrees of 
freedom and mean squares will be 

Degrees 
of freedom Mean square 

Between first-stage units , . . . n' 1 n'Y 2 = V + n" U 

Within first-stage units between 

second-stage units n' (ri f 1) *"* = V 



TOTAL ri n" I 

We then have 

V (X) = ^r-' U + i-=/V (8.11. c) 

where n is the total number of second-stage units and/ is the overall sampling 
fraction (n = ri n" and /=/'/"). The second term of this subdivision is 
the estimate of the variance that would be obtained with single-stage sampling 
of the second-stage units, the same total number of sampling units being taken, 
with first-stage units as strata, and uniform sampling fraction. The analysis 
of variance therefore provides a further alternative subdivision of the sampling 
variance. 

The results are similar with stratification with uniform sampling fraction 
at either or both stages. 

When one or both the sampling fractions are variable, or when the numbers 
of second-stage units in the different first-stage units are unequal, the analysis 
of variance becomes more complicated and the direct approach is often simplest. 
With moderate inequality in the n" the analysis of variance of the first-stage 
units may be carried out on the means, with multiplication of the mean squares 
by n", or better by the harmonic mean of the ri' , i.e. the reciprocal of the mean 
of the reciprocals. 

Alternatively the whole analysis may be carried out in terms of the second- 
stage units. In this case both the means (in terms of the second-stage units) 
and totals of the first-stage units are tabulated, the sums of squares being obtained 
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by the " mean X total " rule, i.e. every mean is multiplied by the corresponding 
total. 

If the second method is used n" can be replaced by n" in the expression 
n"$'* = V + n" U given above for the mean square for first-stage units of 
a random sample, or better by w " where 

H " {$(*") - S(n"*)IS(n")}l(n' - 1) (8.11. d) 

With a stratified sample a value for " is calculated for each stratum and a 
weighted mean taken, weighting by the degrees of freedom contributed to the 
within-strata sum of squares (Cochran, 1939, A). 

These alternative methods of analysis are not exactly equivalent, but we 
cannot discuss their differences here, beyond stating that the first method is 
generally best when all the first-stage units are of approximately the same 
size and the variation in the numbers of second-stage units per first-stage unit 
is due to extraneous causes, whereas the second method is likely to be preferable 
when the first-stage units vary greatly in size and the number of second-stage 
units per first-stage unit is about proportional to this size. 

The above methods can easily be extended to multi-stage sampling with 
more than two stages. 

Example 8 .11. a 

Calculate the expected sampling errors of the wheat acreages derived from 
the two-stage sample B l of Hertfordshire farms of Table S.ll.b, and discuss 
the effects of varying the number of parishes in the sample, with adjustment 
of the second-stage sampling fraction so as to give the same total number 
of farms in the sample. 

Part A of the variance has already been determined in Example 8.9.c. 
We have A = 8-263 x 10 6 . 

The determination of part B requires the evaluation of the variance of the 
r for individual parishes due to the second-stage sampling of these parishes. 
These variances were evaluated separately for each of the 17 parishes of the 
sample, using method (a) of Section 7.9. The mean value of these variances 
V"(r) was found to be 0-003575. 

The equation of estimation of the total acreage is Y = X* Pi. The 
second-stage variance of r/ is V"(r)/;z, and part B of the variance is therefore 
given by 

B = V"(r) L X* 2 /n/ = 0-003575 x 45429 X 10 8 == 16-24 x 10 6 
Hence V (Y) = 24-50 X 10 6 . 

Exact treatment of the effects of varying the number of parishes is complicated 
by the fact that the first-stage sampling fractions are bound to vary somewhat 
from district to district, and that the number of farms per parish is also variable. 
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Ignoring these sources of disturbance we may write for any number n' of 
parishes and n" of farms per parish 



The values of a and /? can be determined from the values of A and 5. 
The mean number of farms per parish is N" = 2496/91 = 27-429. Putting 
*' = 17, /' = 17/91, /" = J, and n" = J X 27-429 = 6-857, we have 

a = - ^ x 8 .263 X 10 6 = 172-7 X 10 6 
j. ~~~ i / /y i 

17 X 6-857 
5 ^ iiJLr^i. x iQ.24 x 10 6 = 2524-2 x 10 6 



The effect of any variation in n 1 and n!' can now be determined from the 
formula for V (Y). If the total number of farms n (== n 1 n") is to be kept fixed, 
the formula is best rewritten in the form 

V (Y) = (a - j8/N")/' + ft In - a/N' 

= {80-71 \ri + 2524-2/ - 1-8982} x 10 6 

with the checks that, when n = 91 and n = 2496, V (Y) is zero, and when 
n' = 17 and n = 17 X 6-857 = 116-57, it equals the value given above. 

The values of V (Y)/10 6 for 5, 10, 20, and 30 parishes and a number of 
farms, 116-57, approximately the same as that of the actual sample, are 35-9, 
27-8, 23-8 and 22-4 respectively, If all 91 parishes were sampled the 
corresponding variance would be 20-6. There is thus no great gain in taking 
more than 20 parishes when sampling within parishes is with a uniform 
sampling fraction. 

It should be noted that the use of the above formula when the fraction 
of parishes sampled is large is unrealistic, in that sampling with probability 
proportional to size could not be adopted in such cases. It serves, however, 
to illustrate the use of the similar formulas which could be developed for 
sampling with uniform probability at the first stage. 

Example S.ll.b 

Repeat the analysis of Example 8. 11. a for the sample J5 2 of Table S.ll.b. 

Part A of the total variance will be the same as in sample B^ 

Part B can be calculated in the same manner as in Example 8. 11. a, using 

the method of Section 7.10, with a separate value of the ratio for each parish. 

This gives V" (r) = 0-0008167, and B = 3-710 X 10 6 . Hence A + B 

= 11-97 X 10 6 . 

The expression of V (Y) in terms of ri and n" is complicated by the variable 

sampling fraction at the second stage. As a first approximation we may take 

an average sampling fraction/" for the given sample, and use this to obtain 
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a formula of the same form as in Example 8 .11 .a. This gives/" = 0*29121 
and we then find 

V (Y) = {146-83/n' + 710-76/w - 1-8982} X 10 e 

with checks as before. 

This formula gives values of V(Y)/10 6 for 5, 10, 20 and 30 parishes and 
135*8 farms of 32-7, 18*0, 10-7 and 8-2 respectively. With more accurate 
sampling of the farms within the selected parishes, therefore, there is a more 
marked decrease in variance as the number of parishes is increased. The above 
values underestimate the decrease, as with the reduction in the number of farms 
per parish the change in the second-stage sampling fractions will result in a 
somewhat smaller increase in the variance than that given by the formula. 
More accurate values could be obtained by recalculating the second-stage 
variances with various intensities of second-stage sampling, using graphical 
methods for interpolation between the calculated values. 

Example S.ll.c 

Investigate the relative precision of the determination of the acreages of 
crops by the measurement of the areas of fields and part fields included in a 
sample of rectangular areas, and the use of grids of points covering these areas 
(Section 4.24). 

If a sample area has an area a and the proportion of the area occupied by 
a given crop is p, the area y occupied by this crop in this sample area equals a p. 
If a random set of n points is taken over the area the variance of the estimate y 
given by the proportion of points falling in the given crop is 



This variance will be additional to the sampling variance of the y over 
the sample areas. This latter variance depends on the sampling method and 
the variability of the y from area to area, and can only be determined from 
actual sample data. 

As an example we may consider the case in which the areas are randomly 
selected, and the frequency distribution of the areas with proportions 0-0, 
0-1, . . . of the given crop is as follows: 

Proportion, p 0-0 0-1 0-2 0-3 0-4 0-5 0-6 0-7 0-8 0-9 1-0 
Frequency, 9? 0-05 0-15 0-20 0-15 0-12 0*10 0-07 0-05 0-03 0-03 0-05 

The average variance due to the point sampling is given by 

S (p V (y) = (a*/n) S <ppq = 0-1656 a*/n 
The sampling variance of the y is given by 

V (y) = S cp p 2 a 2 - (S ypa)* = 0-069024 a* 

The proportional increase in variance due to point sampling is therefore 
0-1656/0-069024 w = 2-40/n 
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With 9 points there is an increase of 27 per cent, in the variance and with 
16 points an increase of 15 per cent. In the latter case about one-seventh 
more areas will be required for the same accuracy. Against this must be set 
the fact that only fields in which the points fall need be examined and recorded. 
The occurrence of mixed crops and systematic location of the points on a 
rectangular grid will also reduce the sampling variance. 

8.12 An example of a pilot sampling scheme for crop estimation 

In order to investigate the practicability of obtaining estimates of the yields 
of cereal crops in the United Kingdom by the harvesting of sample areas, the 
yields of a number of wheat fields were determined by this method in each of 
the years 1934-1938 (Cochran, 1939, A). Fields were taken in several districts 
each year, one or two fields being selected at random from the fields growing 
wheat on each chosen farm. The selection of farms in each district was not 
random, the farms being taken in the neighbourhood of the centres at which 
the investigators were located. 

The sampling of the individual fields followed the lines described in 
Section 4.29, the fields being traversed in the direction of the rows, along two 
lines selected at random. Two sets of unit areas were taken from each line. 
Each unit area consisted of J- metre of each of 6 contiguous rows. For the 
most part, sets each contained three unit areas, equally spaced along the line, 
with a random starting point. 

The yields of grain obtained in 1937 are shown in Table 8. 12. a. The 
mean yield of all the unit areas in each set is given. In order to allow for 
differences in row spacing on the different fields the yields have been reduced 
to a 6-inch row spacing, and therefore represent the yields in grams of areas 
of J metre X 3 ft. Fields on the same farm are indicated by brackets. In 
District III, where three fields from a single farm were sampled, each field 
was growing two varieties which were sampled separately. 

The analysis of variance was carried out in units of the totals of the four 
sets, i.e. on yields of areas of 1 metre x 3 ft. or 0*000226 acres. Thus the 
sum of squares of the sets is multiplied by 4, and the sum of squares of the 
line totals by 2. 

The sums of squares for 1937 can be obtained from Table 8. 12. a by 
calculating the sum of squares for each classification, disregarding the others, 
and deducting the sum of squares corresponding to the next higher classification. 
The rule of " mean X total " or " tota! 2 /(number of units) " is followed in 
each case. Thus the correction for the mean is ll,706 2 /39 = 3,513,601. The 
sum of squares for districts is 

J (1098) 2 + i (909) 2 + . . . 3,513,601 = 54,224 
The sum of squares for farms is 

J (422) 2 + | (676) 2 + i (619) 2 + 290 2 + * (2053) 2 + . . . 

3,513,601 - 54,224 = 132,062 
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The arithmetical work can be simplified by omitting items which are 
repeated in more than one sum of squares. In particular the sum of squares 
for varieties is 

3642 4. 4Q5 2 + 287 2 + ... - $ (769) 2 - $ (597) 2 - (687) 2 = 1326 

TABLE 8.12.aSAMPLiNG OF WHEAT FIELDS, 1937: MEAN YIELDS OF GRAIN 

PER UNIT AREA (0-0000565 ACRES) IN GRAMS 



1st /Set 1 

line "^Set 2 


District I 

47 48 75 105 
63 51 71 82 


District II 

93 58 76 

84 78 57 


District III 


92 89 89 
83 111 58 


75 70 80 
72 85 97 


2nd f Set 3 
line \ Set 4 


67 45 75 97 
55 46 85 86 


75 68 79 
80 83 78 


93 90 70 
96 115 70 


82 76 111 
81 102 66 




232 190 306 370 


332 287 290 


364 405 287 


310 333 354 

' v f 




422 676 


619 


769 597 687 


Totals 


1098 


909 


2053 






District IV 




District V 


1st fSet 1 
line 1 Set 2 


29 45 57 69 78 59 68 97 
21 39 63 55 109 59 56 88 


60 65 81 
53 59 94 


77 60 
74 88 


2nd /Set 3 
line [Set 4 


29 69 46 21 
31 57 66 40 


90 58 74 109 
51 53 61 95 


43 49 93 
48 71 92 


44 84 
57 97 




110 210 232 185 328 229 259 389 


204 244 360 


252 329 




320 417 






581 


Totals 


2750 


581 




District VI 


1st fSet 1 
line jSet 2 


66 93 55 127 
73 70 56 106 


"84^ 80 81 93 21 84 87 79 90 
80 86 107 106 63 51 67 79 117 


2nd /Set 3 
line \Set 4 


64 80 83 84 
73 67 60 98 


63 88 135 71 50 82 135 71 112 
89 110 82 83 29 80 114 89 122 




276 310 254 415 


316 364 405 353 163 297 403 318 441 




586 669 


680 






Total 


4315 



Furthermore the sum of squares corresponding to the difference between 
any two totals containing the same number of units can be obtained by squaring 
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the difference and dividing by twice the number of units in either total. Thus 
the sum of squares due to varieties is also given by J (41 2 + 23 2 +- 2 1 2 ). 
This gives a useful check in cases in which, as here, many of the sums of 
squares depend on differences of pairs of values. Thus the sum of squares 
between sets is given by 2 (16 2 + 12 s + 3 1 + I 2 +...) = 52,368, that 
between lines by 12 2 + 8 2 -f . . . = 33,924. 

The sum of squares between fields within farms (excluding District III) 
is given by J (42 2 + 64 2 + 45 2 + 100 2 + ...) = 27,702. The sum of squares 
between fields within farms for District III has to be calculated in the ordinary 
manner, since there are three fields, and is (769) 2 + \ (597) 2 + (687) 2 
-J (2053) 2 = 7,401, giving a corresponding total sum of squares of 35,103. 

Those not fully familiar with the analysis of variance technique should 
recalculate the sums of squares of this example in the various alternative ways 
indicated. 

For general purposes it is best to convert the mean squares into some 
common units such as (cwt. per acre) 2 . The conversion factor is here 0*0075861. 
This is done in Table 8.12.b, which also shows the results of the similar 
analyses for the other four years of the investigation. 

TABLE 8.12.b ANALYSIS OF VARIANCE PER FIELD OF YIELDS OF WHEAT GRAIN 

(CWT. PER ACRE) 





1934 


1935 


1936 


1937 


1938 




d.l 


m.s. 


d.l 


m.s. 


d.l 


m.s. 


d.l 


m.s. 


d.l 


m.s. 


Between districts . 


4 


66-5 


6 


318-4 


4 


79-4 


5 


82-3 


4 


206-8 


Within districts be- 






















tween farms 


11 


38*9 


12 


27-1 


7 


62-2 


19 


52-7 


14 


65-3 


Within farms be- 






















tween fields 


_ 





15 


22-8 


8 


31-2 


11 


24-2 


8 


12-1 


Within fields : 






















Sampling error . 


16 


5-33 


40 


6-20 


22 


11-39 


39 


6-60 


28 


9-80 


Between sets 


32 


2-11 


80 


2-18 


45 


2-52 


78 


5-09 


55 


4-78 


Mean yield . 


29-1 


23-3 


24-3 


26-2 


30-7 



The mean squares for the same component of variance in the different 
years are not estimates of precisely the same quantities, owing to variation 
in the numbers of fields per farm, etc. In view of the small number of degrees 
of freedom in each year, however, we shall not lose much information by 
pooling all the years, weighting the mean squares in proportion to the degrees 
of freedom. This pooled estimate is shown in Table 8.12.C. 

From the degrees of freedom we may deduce that there are 91 farms and 
133 fields in all in the sample, Le. a mean of 1-46 fields per farm. Denoting 
the variance per set by V l9 the additional components of variance per line, 
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per field and per farm by V 2 , U x and U 2 respectively, and ignoring the fact 
that a few fields have more than two lines and that the number of fields per 
farm is variable, we have the mean square equivalences shown in Table 8. 12. c. 
Hence V x = 14-0, V 2 = 84, U x = 15-0, U 2 = 18-2. 

TABLE 8.12.c COMBINED ANALYSIS OF VARIANCE 





Degrees 
of 


Mean 


Estimate 




freedom 


square 




Between districts . 


23 


162-3 




Within districts between 








farms 


63 


49-3 


iV^iV. + Ui + 1-46U* 


Within farms between 








fields 


42 


22-7 


iV 1 +iV 2 + U 1 


Within fields between 








lines 


145 


7-69 


iV 1 + JV 2 


Within lines between 








sets .... 


290 


3-50 


*V! 



The value of U 2 is an underestimate, firstly because we have used the mean 
number of fields per farm, instead of calculating the correct value of n " from 
formula S.ll.d, and secondly because of the fact that the sample of farms 
was not random. For 1937 the value of " is 1-29, compared with the 
value of n" of 1-44. 

From the above estimates of the different components of variance we may 
calculate the variance to be expected with a sample of any given type and 
size. If the unweighted mean of the yields per acre of the different fields 
can be taken as the estimate of the mean yield per acre over the country, i.e. 
if the potential bias due to association of yield per acre with size of field, etc., 
can be ignored, the variance of the mean yield per acre with a fixed amount 
of sampling of individual fields and with equal numbers of fields taken from 
all selected farms will depend solely on the number of farms and the number 
of fields in the sample. If these are % and n 2 respectively the variance of the 
mean yield per acre with the same amount of sampling per field as that actually 
adopted wiH be 



where U/ = J V l + V 2 + (J l = 22-7. Thus with 200 fields from 100 farms, 
2 fields per farm, the variance with the above values of the components of 
variance is 0-296. This is equivalent to a standard error of 0*54 cwt. per acre, 
or 2-0 per cent, of the mean yield. 

This is an over-simplification of the practical situation. In general the 
possibility of bias cannot be ignored, and a properly weighted mean must 
therefore be taken. Any statement hi general terms would be difficult, since 
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the weighting depends on the variation in numbers and acreages of the fields 
on the individual farms and the sampling method adopted. Given the numbers 
and acreages of the fields on an adequate sample of farms, however, the weighting 
coefficients for any chosen method of sampling can be determined. If these 
are denoted by w, and if [w] denotes the sum of the weights for all the sampled 
fields on a farm, the variance of the weighted mean will be 



Thus the relative precision of alternative methods of sampling can be evaluated 
without difficulty. 

The above procedure is approximate in another respect which is not entirely 
irrelevant to the practical situation. It has been assumed that the component 
of variance from field to field on the same farm, and that between farms, are 
independent of the size of the farm. This is not likely to be strictly true, and 
may introduce appreciable inaccuracy when a variable sampling fraction is 
used for farms of different sizes. 

It will be noted that no corrections for sampling from a finite population 
are necessary, provided the fraction of the farms in the sample is small. The 
second-stage sampling fraction of fields from farms may be large, but this 
fraction does not enter into the formula for the partition of the variance given 
by the analysis of variance, as for example is shown by formula 8.1 I.e. 

Although the variance of the unbiased estimate depends on the acreages, 
and therefore cannot be easily formulated in general terms, certain general 
statements about the relative precision of different types of sample can be made 
from the above results. 

In the first place we may consider the effect of varying the amount of 
sampling of the selected fields. If the number of sets per line is reduced to 
one, for example, the sampling variance per field will be J V 1 + \ V 2 = 11-2, 
instead of 7 '7. If at the same time the number of lines is increased to four, 
the sampling variance per field will be J Vj + i V 2 = 5-6. 

There is little to be gained by increase in the accuracy of the determination 
of the yields of individual fields, however. With one field per farm the 
effective variance per field if the yields are determined without error will be 
y i _|_ U 2 = 33-2, instead of 40-9. Consequently with the intensity of sampling 
actually adopted the relative accuracy is 0-81. Doubling the number of lines 
per field with two sets per line would only increase the accuracy by 10 per cent. 

The question of whether to sample one or more fields per farm requires 
more consideration. For farms growing a given number of fields of wheat 
(greater than one) the variance if two fields are sampled will be \ U/ + U 2 
= 29-6, whereas if one field is sampled the variance will be 40-9. The whole 
question of the methods of sampling farms and fields is bound up with the 
question of costs, and will be further considered in Section 8.17. 

The effect of varying the number of unit areas per set cannot be precisely 
determined from the above analysis. If the unit areas of each set were randomly 
and independently located, the variance of the set means would be inversely 
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proportional to the number of units per set, and would be determined from 
the variance between sets within lines. With the even spacing of units within 
the set, howevet, we may expect the reduction in variance with increasing 
numbers of unit areas to be somewhat greater than in the case of random 
location. From the basic data giving the yields of the separate unit areas it 
would be possible to determine the variance per unit area within sets, and 
from this variance and the variance between sets within lines a fair idea of the 
departure from the random law could be obtained. 

As a first approximation, however, we may assume that the random law 
holds. In this case, taking two instead of three unit areas per set would multiply 
the value of V x by 3/2, and would therefore raise the value of U/ + U 2 from 
40-9 to 42-6, It was in fact recognized after the first year's work that there 
was little to be gained from having more than a small number of unit areas 
per set, and the number, which was five in the first year, was then reduced 
to three. 

8.13 A special case of two-stage sampling 

The possibility of sampling from within strata with probability proportional 
to size at the first stage, and with second-stage sampling fractions so chosen 
that the overall sampling fraction is uniform, has already been mentioned in 
Section 3.10 and subsequently. 

This case is of considerable practical importance, and also provides a useful 
example of the application of the above methods to the more complicated types 
of two-stage sampling. 

From the results already given in Sections 7.16 and 7.17 we have 



V (Y) = S trt'* V (1 -/OK + /<' X/ 2 V" (ft) (8 . 13) 

where V"(n) is the estimated second-stage variance of ft. 

We will consider the case in which the sampling at the second stage is 
random (or stratified with uniform sampling fraction) and the number of 
second-stage units is taken as the measure of size. If nf first-stage units are 
selected from the rth stratum the probability of selection of the jth unit will 
be nf Ny/Ni, where Ny is the number of second-stage units in the jth first- 
stage unit, etc. The second-stage sampling fraction for this unit, if selected, 
will be /Nf/flj'Nz/, where / is the uniform overall sampling fraction. The 
number of second-stage units selected will be/Nj/0/. Thus the same number 
of second-stage units will be selected from each of the selected first-stage 
units in a given stratum. If the variance cr/ /2 of y per unit at the second stage 
can be taken as constant for the whole of the fth stratum the estimated variance 
of m (= jfy) will be 

M"(ni\ - Si " z 
V v f <"- ft 

which is constant for all the selected units of the tth stratum except for the 
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factor (1 /i/")- Provided all the jV' are moderately small it will be sufficient 
to replace them by ft" =///;' We then have 

V"(FO =*/"(! -/")// Ni 

The X* of formula 8.13 will be replaced by N/, and some form of pooling 
can be adopted to estimate an average value s/ 2 of n-' 2 . In cases in which the 
Sri' 2 are likely to vary markedly, weights corresponding to those given by the 
first term should be used. 

We then have 

V (Y) = */ S W (1 -fftlm' + 2ft' a"* Nt (I -//")// 
Following the previous procedure this may be re-written as 

V (Y) = W Nf 2 (1 -//')/ nf + 2 si"* Nt (1 -//")// 
where 



If the /i' are approximately equal, as will usually be the case, and the $/" 
are the same for all strata, we have 

V (Y) = W (1 -/ 



The second term of V (Y) will be recognized as (1 /")/(! /) times 
the variance which would be obtained with single-stage sampling of the second- 
stage units with uniform sampling fraction and the first-stage units as strata. 
Iff" is small, therefore, the first term gives the increase in variance due to 
the adoption of the two-stage process, as in the case of two-stage random 
sampling. 

8.14 Effect of change in size of the sampling units 

If the population is divided into N large units, each of which is subdivided 
into K small units, and if a sample of k small units from each of n large units 
is taken, an analysis of variance between and within large units can be made, 
and the components of variance U and V estimated as in Section 8.10. 

The estimate of the overall variance between small units will be given, 
as before, by V + U (N 1)/N. That between large-unit means of k small 
units will be given by l/k times the expectation of the mean square between 
large units, i.e. by V/k + U. The variance between large-unit means when 
all K small units of each large unit are included will be V/K + U. 

These results enable us to determine the effect on the sampling error of the 
alternatives of using the large or the small units as sampling units. It will 
be noted that for this determination it is not necessary to have data in which 
all the small units that go to make up the selected large units are observed. 
The analogy with two-stage sampling will be apparent. 
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Various extensions of these results are of interest. If the large units are 
stratified, with N/ units per stratum in the population and m in the sample, 
there will be a further between-strata component of variance U*, and the mean 
squares in the analysis of variance will provide estimates as follows : 

Between strata .. V+*U + **U< 

Within strata between large units . . . . V + A U 
Within large units between small units . . V 
The estimates of the different variances will then be : 

Small units within strata . . . . - - V + U (N* 1)/N* 

Large units (means) within strata . . . . V/K + U 

Large units (means) overall .. .. V/K + U + U (*-!)/* 

The first two variances are the same as previously, except that N is replaced 

The above approach enables us to determine the effect of simultaneous 
change of size of unit and size of strata. This is relevant when the strata can 
be of any size, and the size is therefore chosen to contain two units (or one unit) 
per stratum. If the size of unit is halved and the amount of material in the 
sample remains the same, for example, there wiE be twice as many units, and 
the size of the strata can therefore be halved. We shall then require a four-fold 
analysis of variance into whole strata, half-strata, whole units, and half-units. 
The minimum amount of data required for this purpose will be two whole 
units (of which each half-unit is separately recorded) in each half-stratum. 
The expressions for mean squares and variances will be similar to^ those 
given above, N* being the number of whole units per half-stratum in the 
population, and t the number (two) of half-strata per stratum. These 
expressions are as follows : 
Mean squares : 

Within whole strata between half-strata . . V + 2 U + 4 U* 
Within half-strata between whole units . . V + 2 U 
Within whole units between half-units . . V 
Variances : 

Half-units within half-strata . . . . V + U (Nt 1)/N 

Whole units within half-strata . . . . J V + U 
Whole units within whole strata . . . . J V + U + J U* 
Since there will be half as many whole units as half-units the relative 
precision of the two methods of sampling will be given by 

V+2U+U 
V + U (N* - 1)/N 

8.15 Variation in size of strata 

When the strata boundaries are arbitrary, the size of the strata may be 
varied in such a manner that a fixed number of units require to be selected 
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from each stratum, whatever the size of the sample. The strata will naturally 
be taken as small as possible, i.e. so as to contain two units if a rigorous estimate 
of error is required, or one unit otherwise. 

In order to determine the size of sample required for a given accuracy 
under these conditions it is necessary to know the relation between the size 
of the strata and the within-strata variance. The simplest way in which this 
relation can be determined is to obtain data for all units of a representative 
sample of the largest strata that are of interest. Strata of any smaller size 
can then be constructed, and the within-strata variances calculated. 

A minor difficulty in this construction is that the original strata will only 
be exactly subdivisible into strata of smaller size if these contain numbers of 
units which are integral fractions of the numbers in the original strata. If in 
area sampling the strata are also to be of the same shape, only squares of 
integral fractions, i.e. i, t, . . , will give exact subdivision. For strata of 
intermediate size there will therefore be a certain amount of arbitrariness in the 
location of the strata boundaries. Some objective rule must therefore be 
followed. If it appears desirable, overlapping strata may be used. Thus in a 
case in which the data cover a set of isolated squares, four sets of smaller squares 
may be taken within each large square, each set having a corner point coincident 
with one corner of the large square. 

If the smallest strata likely to be of interest each contain a large number of 
units, the collection of data in full for all the units of these basic strata is likely 
to be laborious. Instead a random sample of such units may be taken. In 
this case the within-strata variances can be estimated by means of an analysis 
of variance similar to that used for change in size of sampling units (Section S . 14). 
If the small units of that section are taken as equivalent to the units of the 
present case, the large units as equivalent to the basic strata, and the strata 
as equivalent to the larger strata, the same expressions hold. 

This procedure has the disadvantage that variances can only be obtained 
for strata which contain an integral number of the basic strata if the strata 
are all to be of the same shape the number must be a square. This disadvantage 
can be overcome by sub-stratifying the basic strata, with random selection 
of units from within these sub-strata. Thus in area sampling with square 
strata, if each basic stratum is subdivided into nine square sub-strata, with 
a minimum of two selected units per sub-stratum, square strata can be con- 
structed with areas of 1, If, 2|, 4, 5|, 7|, 9, ... times the area of the basic 
stratum. Separate analyses of variance will be required for the different sizes 
of strata, but these have certain elements in common. 

When the variances have been calculated for certain sizes of strata an 
approximate variance-size relationship can be constructed by graphical means. 
It is often advantageous to plot the log- variance against the log-size. A straight 
line on this graph represents a variance law of the type a z 2 = a z*> 9 where z is 
the number of units per stratum and a and b are constants. 

For the purpose of determining the size of sample required for a given 
accuracy it is better to plot z <r z 2 against z. If N is the total number of units 
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in the population the number of units in a sample with two units per stratum 
will be 2 N/*, and we shall have z ^ = 2 N V ( y\ Thus the required size 
of the strata can be read off from the graph. With one unit per stratum the 
factor 2 is omitted. 
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ACCURACY (ARBITRARY UNITS) 

FIG. 8.15 RELATIONS BETWEEN COST AND ACCURACY IN SAMPLING FOR MEAN SOIL 

TEMPERATURE OVER A PERIOD, WITH STRATA OF VARYING SIZE 

The mean temperature is estimated from temperatures taken (a) on two days 
selected at random from each stratum (block of days), (&} on one day selected at 
random from each stratum, (c) on days equally spaced throughout the period. 
Reproduced by permission of the Royal Society (Yates, 1948, A). 

It has been pointed out in Section 3.14 that systematic sampling, when 
used on the type of material for which it is suitable, is likely to have an error 
variance which is somewhat less than random sampling with one unit per 
stratum. In neither type of sampling can the sampling error be estimated 
with any certainty from the results of a single sample. In random sampling 
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with one unit per stratum, however, an objective estimate of error is possible 
if additional randomly located units are taken in certain of the strata, but 
in a systematic sample much more elaborate methods have to be used (Yates, 
1948, A), and even then the estimates obtained are not fully objective. 

It may be noted here that the common practice of estimating the error 
of a sample with one unit per stratum by combining the strata in pairs, will 
give an estimate of error which will generally be somewhat greater than the 
true error with strata of double the size and two units per stratum. 

An example of the relation between the accuracy of sampling with two 
units per stratum, sampling with one unit per stratum, and systematic sampling, 
is given in Fig. 8.15. The curves (full lines) are based on the variances 
found for daily soil temperatures at 1 foot depth, each daily reading constituting 
a sampling unit. The cost scale is proportional to the number of units, and 
the accuracy scale gives the accuracy of the sample estimate of the mean soil 
temperature over a period. The curves themselves are based on relations for 
a z 2 of the type given above. The curve of losses due to errors and the broken 
curves will be referred to in Section 8.18. 

The material is of the type in which the reduction in variance with reduction 
in size of strata may be expected to be considerable. This is brought out by 
the curves. The relative precisions of the three types of sampling are given 
by the intercepts of horizontal lines, which are in the ratio 1 : 1-75 : 4-24. 
The relative efficiencies are given by the reciprocals of the intercepts of vertical 
lines, which are in the ratio 1 : 1-36 : 2-22. This provides an illustration of 
the marked difference between relative precision and relative efficiency when 
reduction in the size of the strata results in a considerable reduction in variance 
per unit. 

8.16 Efficiency in terms of cost 

In the previous sections we have described how to determine the size of 
sample necessary to attain results of a given accuracy when various methods 
of sampling are used. We have also indicated how the relative efficiency (in 
terms of numbers of sampling units) of different sampling methods and 
variations in a given method may be judged. Minimization of the number of 
sampling units or amount of material included in the sample will not in general, 
however, give maximum efficiency in terms of cost. To attain this the sampling 
method must be so chosen that the total cost of the survey is minimized. 

To minimize the total cost it is necessary to know the relative costs of the 
different operations. Exact evaluation of these costs is usually troublesome, 
and is only worth while if an extensive survey has to be undertaken, or 
if a series of surveys on similar material is contemplated. The matter is 
complicated by the fact that for many purposes it is the marginal cost of an 
additional unit, rather than the average cost per unit that is required. Never- 
theless it is not difficult in the course of survey operations, or even in the 
course of a pilot survey, to obtain data which will serve to give rough estimates 

283 



SECT. 8.16 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

of the main components of the costs. With the aid of such estimates the 
efficiency of further surveys of similar material can often be substantially 
improved. 

When information on the costs of different types of operation is available 
it is possible to determine the values of the sampling fractions, etc., which 
for a given sampling method will give results of the required accuracy for the 
least cost. Such values may be termed the optimal values. It is also possible 
to determine which of two methods, each employed in the most efficient 
manner, will be the least costly. 

The determination of optimal values of the sampling fractions, etc., requires 
minimization of the cost function, and will be dealt with in the next section. 
The choice between different methods when the optimal values of the sampling 
fractions, etc., are known, or when there are no variants of this type, can be 
obtained directly from the results of the previous sections. 

Thus in the case in which there is the possibility of using supplementary 
information, if c s represents the cost per unit of obtaining the supplementary 
information, Co the marginal cost per unit when no supplementary information 
is obtained (these costs being taken to include the marginal costs of abstraction 
and computation), and C l represents the additional computational cost of 
utilizing the supplementary information (which apart from the above marginal 
cost per unit may be taken as broadly independent of the size of the sample), 
the total cost of a sample of n s units with supplementary information, excluding 
elements of cost which are fixed for both methods, will be 

C s = Ci + n s (co + c s ) 

and that for a sample of no units without supplementary information will be 

(Jo r=:: no GO 

Under conditions in which the error variance is inversely proportional 
to the number of units in the sample, the two samples will be of equal accuracy 
when the numbers of units are in inverse ratio to the relative precision of the 
two methods with equal numbers. If the regression method of adjustment is 
used, therefore, and the sample is random, 

n s jn = 1 p 2 

where p is the true correlation coefficient between the main and the 
supplementary variates (estimate r). 

Hence the use of supplementary information will be more efficient if 

no Co > C I + n (1 p 2 )(c<> + c s ) 
i.e. if 

n (c + c s ) p 2 > n c s + C I 

If the cost of adjustment Q can be ignored this inequality becomes 

dice < P 2 /(1 - P 2 ) 

which is independent of n , and therefore of the accuracy required. Thus, 
for example, under these conditions, if p = | the use of supplementary 
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information will be worth while if the cost of collection is less than one-third 
the cost of taking an equal number of additional units. 

With p = f, however, the gain will not be marked unless the ratio of the 
costs is considerably less than |. With a ratio of | the total costs will be in the 
ratio of 7 : 8 (minimum value, with zero value of the cost ratio, 3 : 4). With 
higher values of p the gains are more marked. With p = f the two methods 
have equal cost when c s jc = 9/7. In this case, when the ratio has the values 
i and i the ratio of the total costs will be 21 : 32 and 35 : 64 respectively 
(minimum value 7 : 16). 

8.17 Minimization of the cost function 

When the sampling fractions, etc., of a method of sampling are not fully 
determined by the accuracy required in the results, the optimal values can be 
determined by minimizing the cost function. 

The total cost can usually be expressed as a linear function of the numbers 
of sampling units n l9 n 29 , . . in the various strata, etc., at least to a first 
approximation, using marginal costs. The simplest procedure is then to add 
a multiple K of this linear function to the expression in terms of n^ n 2> . . . 
for the variance of the required estimate, and differentiate the resultant 
expression with respect to n lf n z , . . . in turn. This minimizes the variance for 
fixed cost, which is equivalent to minimizing the cost for fixed variance. The 
exact procedure will be apparent from the first of the cases treated below. 

(a) Variable sampling fraction 

If the marginal cost of taking an additional unit of the Ah stratum is /, 
the total cost C, omitting constant elements, is given by 

C = I, a m 
Hence 

V(Y) = Sa< 2 (l -//) W/ (8.17.a) 

= E a* (l/m - I/NO W + K (2 a m - C) 

Differentiating with respect to the m and equating to zero, we have the 
t equations (i = 1, 2, . , . t) 

-oj a Nf 2 /*i a + .K<*0 
Hence, since m/Ni =/;, 

/i / __ /A 17 M 

lo.il .D) 



Thus the optimal sampling fractions are proportional to ot/^/a. This is an 
extension of the formula already given in Section 3.5. 

The actual values of the sampling fractions required to attain a given 
accuracy can be obtained by substituting for the /i in equation 8. 17. a and 
solving for K. * This gives 

(VJQSNo*V<* = VOO+ SN/c (8.17. c) 

* An example of the calculations will be found in Example 10.4. 
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If any of the / t - are equal to unity the corresponding terms must be omitted 
from both sides of the equation. This may require a trial solution. 

(b) Two-phase sampling 

If C} represents the cost per unit of obtaining the first-phase information, 
c a the additional cost per unit of obtaining the second-phase information, and 
there are % first-phase units, of which n z are included in the second phase, 
the total cost, apart from constant elements, will be given by 

C = n 1 c l + n gi c z 

When the methods of sampling and estimation are such that the effective 
variances of the estimates at each phase (apart from corrections for finite 
sampling) are inversely proportional to the numbers of units, from the results 
of Section 8.7 we have 

,, (8 . 17 . c) 



Following the above procedure, we find 



where K = a 2 / CT i- The values of n and n 2 required for a given accuracy can 
be obtained by substituting for w 2 ^ n terms of % in V (y). 

(c) Two-phase point sampling 

We will only consider the special case arising in crop estimation (Examples 
6.16.b and 7.15). 

If the acreages are determined from n Q ' points and the yields per acre are 
determined on fields of the crop in question in which n of these points fall, 
and if c' is the cost of visiting the field to determine the nature of the crop, 
and c the additional cost of a yield determination, we have, when a proportion 
p of the area is under the crop and the mean yield per acre is f , 

C = c'n Q ' + cn 

V (Y)/Y = q/p < + V (r)/n f 2 (8.17 .e) 

Hence 

= P ' 



The values of n f and n are best obtained by substitution for n in terms of 
n f in the equation for V (Y). 

(d) Two-stage sampling 

If c' is the cost per first-stage unit, and c" the additional cost per second-stage 
unit, the total cost is given by 

C = n' c' + n c" 
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With a random or stratified random sample with uniform sampling fraction 
and equal numbers of second-stage units per first-stage unit, V (y) is given by 
formula 8.1 I.e. Thus 

V (y) = U/n' + V/n + const., 
where 

U = a ' 2 - a"*/N", and V = a" 2 . 

Following the previous procedure, we find 

W 2/ W '2 = n" 2 = V c'/U c" (8 . 17 .g) 

In other words the number of second-stage units per first-stage unit is 
independent of the accuracy required. The values of ri and n required for 
any given accuracy can be obtained by substitution in the equation for V (y). 
The same formulae hold for any form of two-stage sampling in which V (y) 
can be written in the above form. Thus stratification with a variable sampling 
fraction at the second stage is covered, provided none of the sampling fractions 
are unity. 

(e) Two-stage sampling with probability proportional to size at the first stage 

The solution of the case of sampling from within strata with probability 
proportional to size of unit at the first stage follows similar lines. We find 
that if the cost per second-stage unit is the same for all first-stage units in all 
strata, one condition for minimum cost is that the second-stage sampling 
fractions are so chosen that the overall sampling fraction is uniform. Thus 
the use of a uniform overall sampling fraction, which is computationally con- 
venient, is justified on grounds of minimum cost. The assumption of constant 
cost per second-stage unit will not in fact hold for the component of cost due 
to travel, since the same number of second-stage units will be taken from any 
selected unit of a stratum, and consequently the travel cost per unit will be 
greater for the larger units. This, however, is not likely to reduce the efficiency 
greatly unless travel costs at the second stage are very large. 

The relation between the first-stage and second-stage sampling can also be 
very simply expressed. In the case considered in Section 8.13, in which the 
size of the first-stage units is represented by the number of second-stage units, 
if all the second-stage variances are equal we may put an ' 2 a" 2 Nf'/Nj = 17* 
and a" 2 = V, the costs per first and second-stage unit being taken as c{ and c". 
We then have 

V N + V(V/Q Nf V(Uf ef) 
J V(Y)+ 



Since the number of second-stage units m" per first-stage unit in stratum i 
is the same for all selected units, and independent of which particular units 
are selected, the above equations give m" z = V a'/JJi c", as before, 
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(f) Two-stage sampling of farm and fields 

Any fully general treatment is difficult owing to the fact that the numbers 
of fields per farm carrying a given crop are usually small, and consequently 
what would otherwise be the optimal values of the second-stage sampling 
fractions will give numbers of fields per farm which are not only non-integral, 
but which will in many cases be less than unity. 

If the numbers of fields per farm are sufficiently large for this source of 
disturbance to be neglected, the optimal values of the sampling fractions can 
be simply expressed. 

In order to standardize the notation we may replace the between-farms 
component of variance U 2 , as defined in Section 8.12, by U', and the between- 
fields within-farms component U/ by U". The cost of visiting a farm may be 
taken as c', and that of sampling a field as c", 

We will consider the case in which the farms are divided into size-groups 
with fields which have mean areas a ly a z , . . . . We will further assume that 
the mean acreages per field within a size-group of farms of 1, 2, 3, ... fields 
are the same, and that V (a^/af is constant for all size-groups and for all 
numbers of fields within a size-group. 

In the first place we find that in this case the second-stage sampling fractions 
within a size-group should all have the same value, which is given by 



where Nj/ Ni 2 ' } NV, ... are the numbers of farms in the group with 1, 2, 
3, ... fields respectively, and [W] ? = N;/ + 4 N* 2 ' + 9 N< 3 ' + . . . The 
ratios of the first-stage sampling fractions are given by 



= & say (S.lV.i) 



These equations will serve to give first approximations to the relative 
sampling fractions. The relative efficiency of different variants which are 
practically applicable can then be tested by use of the expression for the 
variance of the weighted mean given in Section 8.12. It will usually be sufficient 
to use the mean acreage for each size-group in evaluating the weights, but in 
evaluating the actual size of sample required the factors af should be replaced 
by a? + V (at). 

Example 8. 17. a 

Determine, from the data of Examples 6.12.b and 7.12.b, the optimal 
proportion of sample plots to eye estimates on conifer stands in a two-phase 
sampling scheme in which eye estimates only are made at the first phase, when 
the cost of visiting a stand and making an eye estimate is t^ the additional 
cost of measuring a sample plot. 
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Consideration of this problem would be relevant if it were possible to 
demarcate and classify the stands into conifers of over 20 years of age, etc., 
from aerial photographs. In this case, if the sampling of stands is with 
probability proportional to size, the variances of the volumes per acre, given 
in Example 7.12.b, will be required. We then have, with the regression 
method of estimation, 

s^ = 4803 * 2 2 = 3579 

K 2 = Q.745 K 2 /(l K 2 ) = 2-92 cJCi = 1/10 

2 / % = v / (2-92/10) = 0-540 

Thus sample plots should be taken on about one-half the stands which are 
visited. 

There is, however, in this case no appreciable gain by the use of two-phase 
sampling. If w 2 ' is the number of sample plots required if no eye estimates 
are made we have, for equal variance, 



which gives 

72 2 /72 2 ' = TZa/Wi + (1 

= 0-540 + 0-460 X 0-745 = 0-883 
njnj = 1-64 

Thus with two-phase sampling 

= 1-64 + 10 x 0-883 = 10-47 



and with single-phase sampling C/ 2 ' ft hes between 10 and 11, depending on 
the saving due to the omission of the eye estimates on the stands that are 
visited. 

One of the reasons why the use of eye estimates is here of little value is 
that the determination of the volumes of individual stands by means of a single 
sample plot per stand is very inaccurate. If more sample plots per stand were 
taken the overall variance of y would be reduced, while the covariance of y 
and x and the variance of x would remain unaltered. Under these circumstances 
two-phase sampling would be more advantageous. Given information on the 
within-stand variance of the sample plots and the cost of taking different 
numbers of sample plots from a stand, the optimal number of sample plots 
per stand could be determined. 

Example 8.17. b 

If in the crop survey of Examples 6.16.b and 7.15 the additional cost of 
crop-cutting in order to obtain an estimate of yield is 20 times the cost of 
visiting a sampling point to ascertain the nature of the crop, calculate the optimal 
ratio of the number of yield determinations to total sample points, and the 
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number of points required to give an estimate of the total yield with a standard 
error of 5 per cent. 

From the results already given we have 

p = 2,202/33,255 = 0-0662 q 0-9338 

r = 15 -7 V (r) = 3-5 2 V (r)/F 2 = 0-0497 

Hence, from equation 8.17.f, 

^ /.Q662 x -0497 
' - V -9338 X 20 

Substituting in equation 8.17.e, 

(05) 2 HQ = -9338/-0662 + -0497/-0133 
' = 7140 n = 95 n' = 473 

The large number of points that have to be visited to ascertain the crop is 
accounted for by the small fraction of the total land area under crop. If the 
crop is an important one it will occupy a considerably larger fraction of the 
cultivated area, and if, therefore, the non-cultivated areas can be excluded, 
the total number of points required will be considerably reduced. Alternatively 
sparsely cultivated areas may be sampled with a lower intensity. 

It is also worth noting that if several crops have to be surveyed it will 
probably be possible to make the acreage determinations of all crops 
simultaneously prior to the crop-cutting work. This will alter the above cost 
relationships. The general case can be dealt with by minimization of the 
combined cost function. In the simple case in which there are a number of 
crops each occupying the same area and having the same variance and cost 
relationships, and in which the same accuracy is required for each crop, the 
above solution holds, the cost of the acreage determinations being spread 
equally over all the crops, and adjustment of c being made for the cost of 
revisits. Thus in the above example, with 5 crops and a cost of revisit per 
point of double the original cost (owing to wider dispersion), all that is necessary 
is to put c\c' 110. We then find ' = 9160, n = 52, n' = 606. 

Example 8.17 .c 

If county lists of farms are not available, and if the cost of the construction 
of a list of the farms of a parish is 10 times the cost of visiting a single farm 
within the parish and ascertaining the wheat acreage, determine the optimal 
sampling fractions at the first and second stage which will give estimates of 
the acreage of wheat having a standard error of 4500 (i.e. approximately 
10 per cent.), using the methods of sampling followed in samples B l and B 2 
of Hertfordshire farms. 
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From the equation for V (Y) in terms of ri and n given in Example 8 .11, a 
and equation 8.17.g, we have 

n" = V(2524-2 X 10/80-71) = 17-7 
Hence, for the required accuracy, 

4-5 2 = 80-71/n' + 2524-2/17-7 ri 1-8982 
ri = 10-1 
n = 178-8 

Similarly, from Example S.ll.b, 



For the reasons already given these latter values are approximate. When 
proper allowance is made for the changes in sampling fractions at the second 
stage a somewhat smaller sample will be found to be necessary. 

Example 8.17.d 

From the data of Table 6. 19. a determine suitable sampling fractions for 
a crop-estimation scheme for sugar beet, on the assumption that variances 
between fields on the same farm, and between farms, are the same as those 
found for wheat in Section 8.12, and that the cost of visiting a farm is (a) equal 
to, and (b) twice that of sampling a field. 

From Table 6. 19. a, including farms not growing sugar beet on old arable 
land, we obtain the following values : 

nf |V] 2 ai 

Small farms ...... 30 51 3-1 

Medium farms ...... 47 224 8-4 

Large farms ...... 14 121 12-3 

Taking the estimates of N//[W] 2 given by /'/[*''] formulas 8.17.h and 
8.17.i give the following values for//' and//': 

fi" ft" fi'/k 

Cl =*c t CL = | % Me 

Small farms ...... 0-86 1-21 4-0 

Medium farms ... ... 0-51 0-72 18-3 

Large farms ...... 0-38 0-54 36-2 

Thus when c = 2 we may consider variants on the following scheme. 
For small farms sample all fields ; for medium farms sample one field from 
farms growing 1 or 2 fields, two fields from farms growing 3 or 4 fields, etc. ; 
for large farms sample one field from farms growing 1, 2, 3 or 4 fields, two fields 
from farms growing 5, 6 or 7 fields, etc. ; sample farms in the proportions 
1:4:8. 
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A similar scheme can be drawn up for the case when q =*= $c 2 . In this 
case, since /j" is greater than unity, some increase in the proportion of small 
farms included may be advisable. 

Further investigation is left to the reader. 

8.18 Losses due to errors 

We have so far considered the minimization of the cost of a survey when 
a given accuracy is required in the results. Since, however, errors in the 
results themselves give rise to losses when these results are used as a basis 
for further action, the accuracy should itself be determined in such a manner 
that the sum of the cost of the survey and the expected losses due to the 
resultant errors is minimized. 

If the loss due to an error Z in an estimate Y is equal to a Z 2 , where a is 
a constant, the average loss in a series of samples of the same size and type 
In which the estimates are free from bias will be a V (Y), whatever the actual 
form of the distribution function of the errors. 

When the loss due to an error is proportional to the square of the error, 
therefore, minimization of the sum of the cost C of a survey and the average 
loss due to errors requires minimization of the function 



the sampling method and size of sample being so chosen that V (Y) has its 
minimum possible value for the cost C. 

Under these circumstances V (Y) will always be expressible as a function 
of C. In many cases, as we have seen, this function is of the form 



where h and k are constants depending on the population which is being 
surveyed, and Co is the overhead or constant component of cost which is 
independent of the size of the survey. 

In this case the minimum value will be attained when 



We then have 



This implies that the more accurate the results that can be obtained with a 
given cost, Le. the smaller the value of h, the higher should be the accuracy 
aimed at with a given loss function. The value of the cost-plus-loss function 
at the minimum is in fact 2 C Co a k. Any saving due to increased 
accuracy should therefore be divided equally between reduction in the cost 
of the survey and reduction in the loss due to errors. Equally if the loss due 
to a given error is multiplied by a factor A, the funds devoted to the survey 
(excluding overhead) should be multiplied by 

292 



EFFICIENCY SECT. 8.18 

The same general conclusions hold when the variance-cost relation is of 
a more complicated form than that given above. A case of this type is 
illustrated in Fig. 8.15, in which a loss curve of the form a V (Y) has been 
inserted. We see that with the more accurate methods of sampling the minima 
of the cost-plus-loss functions (shown by broken lines) are attained when both 
the cost of the sampling and the loss due to errors are less than with the less 
accurate methods. 

Other loss functions will lead to more complicated expressions for the 
average loss. The most general loss function which is capable of relatively 
simple expression in terms of V (Y) is that in which the loss due to a positive 
error is equal to a Z b , and that due to a negative error is a' ( Z) & , a, a' and b 
being constants. Provided the distribution of errors has the same form for all 
values of V (Y), the average loss is then equal to a" <J & , where cr 2 = V (Y) 
and a" has a value which is a linear function of a and a'. The actual linear 
function can only be determined if the distribution function of the errors is 
known. In general terms, if the distribution function of the errors of Y is 
/! (#) dz, where z = Z/a , we have 

a" = a f o +n **A () ds + a'J_ J- *?& (*) d 

If the distribution of the errors is normal, yj (#) d% will be of the form given 
in Section 7.3, with a = 1. The two integrals will in this case (as hi any 
symmetrical distribution) be equal. Their values for any value of b can be 
obtained from existing tables*, those for b = 1-0, 1-25, 1-5, 1-75, 2-0 being 
0-3989, 04097, 0-4300, 0-4599, 0-5 respectively. It must be emphasized, 
however, that the distribution of sampling errors is frequently not sufficiently 
normal for the use of these values to be justified. In such cases, also, the 
form of the distribution may be expected to change with change in the size of 
the sample. 

With this more general loss function we require to minimize the function 

h 



which will be minimum when 



This equation can easily be solved by trial and error, or directly if k can be 
neglected, as will be the case when all sampling fractions are small. The 
same general conclusion that the accuracy should be increased with a more 
accurate method of survey still holds. 

When V (Y) is a more complicated function of C (possibly only determined 
in numerical form) the minimum of the cost-plus-loss function can itself be 
determined by trial and error. 

* Table* *>f the Gamma function. 
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8.19 Concluding remarks 

The preceding sections give an indication of the ways in which the efficiencies 
of different sampling methods can be compared, and the techniques of 
determining the optimal sampling fractions, size of sample required for a given 
accuracy, etc. It has been further shown that the accuracy which should 
be aimed at is itself related to the losses resulting from errors in the survey 
results. 

Since determination of the optimal accuracy from the expected losses due 
to errors demands knowledge of the loss-function, it will chiefly be of relevance 
when action in the economic sphere has to be based on the results of the 
survey. An error in the estimate of the yield of a crop, for instance, may 
require changes in an import programme, or may lead to wastage, and the 
resultant additional costs may be assessable, at least roughly. The losses 
due to errors in estimates provided by surveys of the research and investigational 
type can scarcely be assessed. Indeed, it is usually impossible to give^any 
quantitative estimate in monetary terms of the value of the information provided 
by such surveys. The decision to undertake the survey, and the accuracy 
aimed at, must then be a matter of judgment on the part of those who require 
the information, and those who are concerned with the allocation of resources. 

Even if the optimal accuracy cannot be quantitatively determined, arbitrary 
decisions on the accuracy required should as far as possible be avoided. Before 
any decision as to accuracy is taken, estimates should be prepared of the costs 
of obtaining results of differing degrees of accuracy, and these estimates should 
be considered in relation to the purposes for which the results are required. 

Minimization of costs can of course be carried out whether or not a loss- 
function is available. In this chapter we have only considered this minimization 
when a single quantity requires estimation. In most censuses and surveys 
such treatment would be an over-simplification. A number of quantities will 
require to be estimated, frequently for many domains of study. It may then 
be necessary to carry out a more elaborate investigation, minimizing the cost 
for defined accuracies of all the estimated quantities. Alternatively, if loss- 
functions are available for all of these quantities, the combined cost-plus-loss 
function can be minimized. Frequently, however, one of the quantities is of 
dominant importance, and the situation is such that when adequate accuracy 
is attained on this quantity the remaining quantities are determined with more 
than the required accuracy. In this case minimization can be conducted solely 
with reference to this quantity. 

Many of the examples worked out in this chapter are based on very small 
amounts of data, and the conclusions reached on the relative efficiencies of 
different methods, even in the particular circumstances of the chosen examples, 
must therefore be treated with reserve. These examples are, in fact, merely 
intended to illustrate the computational procedures, and bring out the various 
points that have to be taken into account when making calculations of relative 
efficiencies, optimal sampling fractions and size of sample. They are in no 
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way intended as general investigations into the relative efficiency of the different 
methods. 

On the other hand, it should be borne in mind that no very exact 
determinations of the optimal sampling fractions and size of sample are required 
in the practical planning of surveys. If the values adopted are somewhere near 
the optimal the total cost, or the total of costs-plus-losses, will be very near 
the minimum. 

We must also not be deterred from undertaking a survey by the fact that 
there is little information on which to base exact planning of the sampling 
methods. As we have seen, surveys themselves provide information which 
will enable future surveys on similar material to be more efficiently planned. 
In surveys on relatively unknown material one of the points to be kept in mind 
in the planning is that information will be required both on variances and on 
costs. Equally, if preliminary rough estimates are required, pilot investigations 
can be designed so as to provide such estimates, as well as information on 
which to base the planning of a larger survey. 

The study of the relative efficiency of different sampling methods depends 
not so much on having a large amount of data as on having data which are 
relevant to the methods concerned. Thus the small pilot investigation on the 
estimation of wheat yields by sampling methods described in Section 8.12 
was of sufficient size to give estimates of both the field-to-field and farm-to-farm 
components of variance with all necessary accuracy. On the other hand, in 
the Survey of Fertilizer Practice although a very large amount of data has 
now been accumulated it is impossible to determine the field-to-field 
components of variation of the fertilizer dressings, since only one old- and one 
new-arable field of each crop was taken on each farm. This must be regarded 
as a defect in the planning of this survey, which could have been remedied 
had a pair of fields been taken for the various crops on a small proportion of 
the farms. Incidentally, lack of this information has also prevented any 
consideration of the question of the extent to which individual farmers vary 
their fertilizer practice from field to field of the same crop. 

Given the necessary data, the increase in the efficiency of survey methods 
requires proper statistical investigations of the types outlined in this chapter. 
The need for thorough investigation of the efficiencies of different sampling 
methods in different circumstances is great, and it is to be hoped that many 
more will be made and reported in the future. Such investigations are often 
neglected because, once a survey has been completed, the question of whether 
it could have been carried out more efficiently is largely historical as far as 
that survey is concerned. One of the reasons why both the theory and practice 
of sample .censuses and surveys has made rapid advances in recent years 
is that permanent organizations often part of, or attached to, research statistical 
institutes have been set up in a number of countries. These organizations 
have been actively engaged both in the planning and the execution of surveys 
covering various fields of enquiry. Consequently they have not only had 
access to the necessary data, or the means of collecting it, but they have also 
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had a continuing interest in investigations of efficiency, and a body of workers 
who have both the training and experience to carry out these investigations. 
Further progress may be expected on the same lines. In particular the 
problems that arise in censuses and surveys of undeveloped areas will be likely 
to receive very much more thorough investigation when more centres which 
are actively concerned with the planning and execution of surveys in these 
areas are developed. Only in this way will a body of experience be built up 
which is relevant to the special problems of such surveys. 



CHAPTER 9 

FURTHER NOTES ON THE CRITICAL ANALYSIS OF 

SURVEY DATA 

9.1 Introduction 

As has been pointed out at various places in the preceding chapters, surveys 
fall into two main classes : those which have as their object the assessment of 
the characteristics of the population or different parts of it, and those which 
are investigational in character. In the census type of survey, estimates of 
the characteristics, quantitative and qualitative, of the whole population and 
possibly of various previously defined subdivisions of it are required. These 
estimates form the basis of administrative action, either directly or after in- 
corporation with information from other sources. The accuracy to be aimed 
at is determined by the nature of the administrative action that is envisaged. 
In the investigational type of survey we are more concerned with the study of 
relationships between different variates, and with contrasts between different 
domains. In such surveys estimates appertaining to the whole population are 
usually of relatively minor interest. 

The critical analysis of the results of an investigational survey is a much 
more difficult task than is the calculation of estimates and their errors in a 
survey of the census type. The matter has already been briefly discussed in 
Sections 5 . 23 and 5 . 24, and in various of the illustrative examples. In the 
present chapter the matter has been taken somewhat further by the inclusion 
of some additional examples, and by a discussion of the uses of ratios and 
regressions in investigational work. The discussion of sampling errors of 
contrasts between domains an important, if somewhat tiresome, subject 
has also been amplified and extended. It must be emphasised, however, that 
the chapter is not intended to be an exhaustive treatise on the analysis of 
investigational surveys : this would require much more space than is available 
here. 

9.2 Contrasts between domains : random sample 

The formulae for estimates of the domain values are exactly analogous to 
those for the population values given in Section 6.4. If the suffix a denotes 
domain A, so that n a , for example, is the number of units in the sample falling 
in that domain, and ya and S a (y) are the mean and total of all the y values of 
the domain A units in the sample, we have 

p a j a = 5 <9.2.a) 

N=n=Npa (9.2.b) 

) (9.2. c) 



n a 
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Ya S fl (y)N fl 7 (9.2.d) 

Ffl -^W (9.2.c) 

Pfl ~~Sa(x) 

The first two formulae have already been given in Section 6.4 with a slightly 
different notation (n a = , p = p, N a = U). The additional symbol a 
for the proportion of units in the domain A is introduced for convenience. 
As before a small, but usually trivial, gain in accuracy can be obtained by 
replacing N by N. m 

In addition to the proportion pa of all units belonging to domain A we are 
often concerned with the proportion of the units of domain A that possess a 
given attribute, or with their total number. This proportion, which may^be 
denoted by h a , must be clearly distinguished from pa. Following the notation 
of Section 6.4, the corresponding total number in the population may be 
denoted by U fl , and that in the sample by u a . Then 

ha ^ha = Ualna (9.2.f) 

Ua = gUa= haNa (9.2,g) 

The estimation of the variances of y a , Y a and r a differs in two respects 
from the estimation of the variances of y, Y, and r. The first cause of difference 
is that only the component of variance of y within the domain A, and not the 
total variance of y in the population, contributes to the variance of the estimates 
ja and Y a . This component may be denoted by s a 2 . We have 



.,- (9 . 2 . h) 

Ha 1 

The variances within the different domains will not only differ from the 
total variance, being in general less than this total variance, but may also differ 
amongst themselves. If, however, the population is divided into a large number 
of different domains, and the nature of the material is such that the variances 
within the different domains may be expected to be approximately equal, a 
pooled estimate of this variance over all domains may be adopted, using the 
analysis of variance technique in the manner of Section 7.7. 

In a similar manner the variance of r a depends on the variance $m 2 within 
the domain A about the ratio line for that domain. We have 



Ha 1 

The second cause of difference lies in the fact that the variance of the 
total Y a is increased because Na is subject to variation. Moreover the numbers 
of selected sample units na and nb in two domains are negatively correlated, 
the covariance of n a and m being 2 pa pb/(n 1), and this gives rise to 
negative covariances between pa and p&, N a and N&, and Ya and Yfe. 

Furthermore, for the reason given in Section 7.2, the factor */(L /), 
which makes allowance for sampling from a finite population, should normally 
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be omitted from the variance formulae when comparisons between different 
domains are being made in investigational studies. 

Putting q a = 1 pa t the resultant formulae for the variances and co- 
variances of the estimates are as follows : 



Vf vfrggfl-/) 
Pa ^ ra 1 

/- ,.,_ P*P*( l ~f) 



(9.2.k) 



n 1 

' ggg( !"" /) (9.2.1) 



cov (Na, NO = -* n r*^-J> ( 9.2. m ) 

V (y a ) = :Lz/ Ja * (9.2.n) 

v/ ' ?i a v ; 

[y a , yO = (9.2.0) 

V (Ya) = g * n (1 "-ft {na qa ^ + (na " l} 5a * } (9.2.p) 
n 1 



cov 



n-1 
>ra a (9.2.r) 

COY (r a , FO = (9.2.8) 

The variances of h a and Ua, and the corresponding covariances, can be 
obtained from the variances and covariances of fa and Y by scoring all units 
with the attribute 1 and all those without the attribute 0. This gives 

y a = ha = ha f Ya = Ua, $a 2 = n a ha (1 ha)/(n a 1). 

The quantity in the curly bracket of formula 9.2.p will be found to equal 

na ha (1 Pa ha). 

If each of the domains of study with which we are concerned forms only a 
small fraction of the whole population, the covariances will be small relative 
to the corresponding variances, but if the fractions are large the covariances 
will be of some importance and must be taken into account when calculating 
the differences between estimates for the different domains. 

Example 9. 2 

The final poll of the British Institute of Public Opinion in the 1951 election, 
based on a sample of 2,300 individuals, gave a forecast of the voting (excluding 
3*5 per cent, who gave no indication of the way they would vote) as follows 
(Durant and Gregory, 1951, ') 
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Conservative . . . .49-5 per cent. 
Labour ..... 47-0 
Liberal ..... 3-0 
Others ..... 0-5 

If the sample were a random one, what would be the standard error of the 
difference between Conservatives (A) and Labour (B) ? 

The effective number in the sample is 96-5 per cent, of 2,300, *'., 2,220. 
Hence 



-- -000105 

V ( Ra _ pa) = -000113 + -000112 2 ( -000105) = -000435 = -0209 2 . 

Thus the predicted Conservative percentage majority of 2-5 per cent. 
would have an estimated standard error of 2-09 per cent. If the covariance 
term had been omitted the estimate of the standard error would have been 
1-50 per cent., which is substantially below the correct value. 

It should be noted that the samples in polls of this kind are actually quota 
samples, and the random component of error will thereby be reduced by the 
stratification thus introduced. The amount of the reduction can be estimated 
if the numbers and voting intention of the different strata (quota categories) 
are known, using the formulae of the next section. The reduction will only 
be substantial, however, if the differences in voting intention of the different 
strata are very substantial (see Example 8. 2. a). The non-random components 
of error in forecasts of this kind have been discussed in Section 4 . 22. 

9.3 Contrasts between domains : stratified sample with uniform or 
variable sampling fraction 

Three cases arise. The domains of study may consist of strata or groups 
of strata, they may consist of parts of a single stratum, or they may cut across 
strata. The first and second cases present no new problems. In the first 
case, since the part of the sample belonging to any domain constitutes a random 
or stratified random sample of that domain, the methods developed in the 
previous chapters apply. Moreover, since the sampling of the different strata is 
independent there will be no covariance between the estimates for the different 
domains. In the second case, the relevant part of the sample constitutes a 
random sample of the stratum concerned, and the formulas of the last section 
are therefore appropriate. 
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In the third case, in which the domains cut across strata, the variances will 
be affected in much the same manner as in a random sample, except that in 
this case V (fa) also loses its simple form. The formulae for V (Ya) and V (ya) 
have already been given in Section 7 . 6 but are repeated here for convenience, 
using the notation of the present chapter. 

V (Na) = L gi 2 ni 2 (1 - /,) pta qtal(nt - 1) (9 . 3 . a) 

cov (Na, N) = - S# a / a (1 -ft)piapib/(nt - 1) (9.3.b) 

Na 2 V (ya) = Vgt* ni (I /,) {nta qta (yta - /a) 2 + (nia - 1) Wa 2 }/(/ - 1) 

(9.3.c) 

Na Nd COV(y a , /&)= 2# 2 Wj 2 (1 fi)piapib (jia /a) (Jfo y&)/(f 1) 

(9.3.d) 
V (Ya) = S# / (1 -//) {Wa #a J7*te + (nta - 1) a a }/(w - 1) 

= Sft fe/ - 1) { / 5 /fl y a - (Stay)*}l(ni - 1) /9 .3 .e) 

COV (Ya, Yb) = - S ^- 2 72/ 2 (1 ~ /i) jp/a jp/6 J^m >?i6/(n/ - 1) (9 . 3 . f) 

X a 2 V (Fa) = S^z 2 m (1 ft) {nia qta (jia Pa ^a) 2 + (nta I) W 1 }/ 

(m-l) (9.3.g) 

Xa Xd COV (Pa, Ffr) = E# a n z ' 2 (1 fi)piapib (jfia Pa Wa) (y/ fl P& *)/ 

(w 1) (9.3.h) 

For purposes of computation it is often convenient to replace mpia by 
may etc., and when the factors (1 ft) are included to replace gt 2 (1 ft) by 

gi (^ - !) 

As before, the covariances are of relatively little importance if each domain 

covers only a small part of each stratum. In this case the qt a will be nearly 
unity. As mentioned in Section 7.6 and illustrated in Example 7.7.b, an 
approximate estimate of the sampling error of the domain means (and ratio 
estimates) will then be obtained by treating the sample as if it were stratified 
for the domains but not for the strata^ provided the sampling fraction is uniform. 
A similar estimate for the errors of the domain totals will be obtained by 
treating the sample in the same manner, but omitting the corrections for the 
means, i.e. by replacing Sa (y ya) 2 by ^a (y 2 ). 

This simplification does not hold when the sampling fraction is variable. 
In this case the full formulae must he used.* 

Example 9.3 

In the National Farm Survey (Section 5.21) for the county of Hereford the 
farms of the sample (excluding size-group 1) were classified into the following 
domains : 

Percentage of arable land 

A Mainly grass 0-29-9 

B Intermediate 30-49-9 

C Mainly arable 50-100 

* Multi-stage sampling is considered by Durbin (1958, A"). 
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Estimate the variances and covariances of the numbers of farms and the total 
and mean acreages of crops and grass in the various domains. 

The basic data are given in Table 9. 3. a (there were no farms in size-group 
5). As an example we may give an outline of the computation of the variances 
and covariances of the mean acreages. The first step is to prepare tables of 
pia, yta and Sia 2 . These are given in Table 9 . 3 . b. Tables of qm and (yt a y a ) 
(not reproduced here) will also be required. For each size-group the term of 
any particular variance or covariance can then be computed. Finally the relevant 
terms can be added and divided by N a 2 , etc., to give the variances and covariances. 



TABLE 9. 3. a NUMBERS, TOTAL ACREAGES AND SUMS OF SQUARES 

Domain 
Size-group ABC Total 



Number of farms, ma, etc. 

2 72 62 39 

3 79 155 108 

4 13 61 25 

Raised total 1,062 1,362 872 



173 

34:2 

99 
3,296 



Total acreage, Sia (y)> etc. 

2 3,708 3,553 2,054 9,315 

3 11,570 27,410 18,800 57,780 

4 4,860 22,460 9,410 36,730 

Raised total 93,080 190,090 114,560 397,730 

Sum of squares, Sta Cv 2 ), etc. 

2 225,980 243,987 124,252 594,219 

3 1,838,900 5,409,900 3,582,000 10,830,800 

4 1,850,400 8,537,800 3,649,900 14,038,100 



TABLE 9.3.b VALUES OF pta, yia AND sic? 

Size-group g i p ia p ib 
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10 


4162 


3584 


2254 


3 


4 


-2310 


4532 


3158 


4 


2 


1313 


6162 
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Size-group y^ y ib y^ y^ 



2 51-50 
3 146-46 
4 373-85 


57-31 

176-84 
368-20 


52-67 
174-07 
376-40 


53-84 
168-95 
371-01 


y a , etc. 87-65 


139-57 


131-38 


120-67 


Size-group s^ 


^ 


^ 


tf 


2 493-2 
3 1851-4 
4 2792-3 


661-9 
3654-2 
4468-4 


423-0 
2891-7 
4499-0 


538-7 
3135-0 

4192-8 



The results are shown in Table 9 . 3 . c. It will be seen that the variances of 
Ya and ya are both substantially greater for the separate domains than would 
be expected from the corresponding variances for the whole county. 

TABLE 9.3.c VARIANCES AND co VARIANCES 

Variances Co variances 

A B CM A, B A, C B, C 

Na, etc. 4559 4668 3662 2783 1776 1885 

Ya/100, etc. 3394 6111 4528 2208 2028 1257 - 2627 

ya, etc. 12-72 21-14 34-49 2-03 6-19 5-83 - 9-15 

The variances and covariances for the separate domains can be used to 
calculate the variance of the corresponding estimate for the whole county, 
which thus provides a check. For number and total acreage the agreement 
should be exact. For mean acreage the agreement is only approximate, since 
ya is correlated with N fl . We find, in fact, (1062 2 x 12-72 + ... 2 x 
1062 X 1362 X 6-19 . . .)/3296 2 = 2-70, compared with the correct value 
of 2-03. 

9.4 Errors in the estimation of the proportions of the population total 
attributable to different domains 

In many cases in which a quantitative variate is being studied we are 
interested in the proportions or percentages which are attributable to different 
domains, rather than the actual totals for the domains. Thus in the National 
Farm Survey, described in Section 3.7, interest attached to the proportion 
of the total farm land that was tenant-occupied, the proportion that was farmed 
by full-time farmers, etc. 

Estimates of such proportions are given very simply for all types of sampling 
by dividing the estimate Y a of the total for the domain A by the estimate Y 
of the total for the whole population. Thus, denoting the estimate of the 
proportion by Pa, we have 

Pa=~Y 
303 



SECT. 9.4 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

If the population total is known from other sources we may use Pa to provide 
an alternative estimate Y/ of the domain total which will in general be more 
accurate than Y fl . The formula is 

Ya'=P*Y 

Estimation, therefore, presents no new problems, but the estimates of error 
will be affected by the covariance between Y G and Y. 

Case (a) Domains not cutting across strata 

When a domain A comprises one or more complete strata there will be no 
covariance between Y ff and Y fl -, where Ya_ is the estimate of the total of the 
remainder of the population. We also have Y a + Ya~ = Y. Consequently 
V (Y) = V (Y fl ) + V fY fl -), and cov (Y fl , Y) == V (Y fl ). The ordinary formula 
for the variance of a ratio then gives 

Y* V (Pa) = (1 - 2 Pa) V (Yfl) + Pa 2 V (Y) 



where Q a = I Pa. If there are more than two domains the first form is 
most suitable for computation. 

The covariance between the proportions for two mutually exclusive domains 
A and B can be similarly deduced from the formula for the covariance of two 
ratios, which in the notation of Section 1 . 5 is 



2> yji 
x 4 J 



We thus find 

Y 2 cov (P., Pa) - - {P b V (Yfl) + Pa V (Y & ) - P a ? b V (Y)} 
It is easily verified that when A and B together make up the whole population 
cov (Pa, P) = V (P a ) = V (P&) as it should. More generally, if the popu- 
lation is divided into a number of domains the sum of all the variances plus 
twice the sum of all the co variances should equal zero. This provides a useful 
check when all variances and covariances are required. 

Case (b) Random sample and stratified sample with domains of study cutting 
across strata 

In this case there will be covariance between Y fl , Y&, Y c , . . , It is best 
to calculate first the variances and covariances of Y, Y&, Y c , . . . from the 
formulae given in Sections 9 . 2 and 9 . 3, The variances and covariances of the 
proportions may then be obtained from the formulae 

Y* V (Pa) = V (Y) - 2 Pa cov (Y, Y) + P* 2 V (Y) 

Y 2 cov (P a , Pa) = cov (Yfl, Y*) - Pb cov (Y, Y) - P a cov (Y b , Y) + P a P b V (Y) 
with 

cov (Y, Y) = V f Yfl) + cov (Y fl> Y & ) + cov ( Y*, Y c ) + . . . etc. 
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Example 9. 4 

Estimate the variances and covariances of the percentage of land in Hereford 
attributable to the three types of farm of Example 9 . 3. 

The percentages are 

100 P a = 23-40 
100 ?b = 47-79 
100 P c = 28-80 

From Table 9 . 3 . c the variance and covariance matrix of Y a , Y&, Y c is 
10 4 X 3394 2028 1257 

- 2028 6111 - 2627 

- 1257 - 2627 4528 



109 1456 644 2209 

The column totals give cov (Ya, Y) etc., and the grand total gives V (Y). 
We thus have 

(3394 2 x 0-2340 x 109 + 0-2340 2 x 2209) X 10 s 

1002 V (Pfl) = 397J30^ 

= 2-190 

100 2 COV(Pa, Pfc) 

__( 2028 0-4779 X 109 0-2340x1456 + 0-2340 x 0-4779 X 2209) xlO 8 

397,730 2 
= - 1-374 
The full variance and covariance matrix is 

2-190 1-374 0-816 

1-374 3-302 1-928 

0-816 - 1-928 2-744 
with the check that each column total is zero. 

9.5 Relative precision of different methods of sampling when domains 
of study cut across strata 

The relative precision of various sampling methods when domains of study 
cut across strata can be studied by the methods already outlined in Chapter 8. 
All that is necessary in any particular case is to estimate and compare the 
expected sampling errors when different methods are used. 

From the results already given it will be apparent that with a uniform 
Sampling fraction the gain in accuracy which results from stratification is largely 
lost for domains of study that cut across strata. It is therefore important 
where practicable to use strata which correspond to the expected domains of 
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study. This, however, is not always possible, either because of the resulting 
increase in complexity or because the information necessary for classification 
of the sampling units into appropriate domains of study is only obtained in the 
course of the survey. In the National Farm Survey described in Section 5.21, 
for example, it would have been impossible to stratify for all the various domains 
of study into which the data were subsequently broken down. Information 
on such items as type of occupancy was not known in advance (indeed, collection 
of this information was one of the objects of the survey) ; but even had it been 
available the number of different types of domain, all of which cut across one 
another, was so great that the number of sub-classes thereby created would 
have been far too numerous to be used as strata. 

When a variable sampling fraction is used the situation is somewhat different. 
Although there is likely to be a large increase in variance when domains of 
study cut across strata, there will still be substantial gains from the use of a 
variable sampling fraction in place of a uniform sampling fraction. The 
optimal values of the sampling fractions will, however, differ from those which 
are optimal for the population estimates and will indeed depend on what 
quantities numbers, totals, means or proportions require estimation. The 
sampling will be optimal for population estimates of means or totals when the 
sampling fractions are such that the values of // ^ci are proportional to a/ 
(Section 8.17 (a)). For the estimation of a mean of a particular domain the 
sampling will be approximately optimal when fi \/Ci is proportional (apart 
from errors of estimation) to the square root of l/(m I) times the quantity 
in curly brackets in the formula for V (y a ) of Section 9.3. Replacing m a 1 
by ma and ni 1 by m gives fi <\/a approximately proportional to the square 
root of 

pia {qta (yia - >) 2 + 8 } (9 . 5 . a) 

For the corresponding total fi ^/a must similarly be approximately proportional 
to the square root of 

pia{qtayta* + Sia*} (9.5.b) 

The expression for a ratio is similar. 

If the strata consist of size-groups of the variate y, or of a variate x highly 
correlated with y, and if we take ft \/ci proportional to yi or #/ for the 
different size-groups, allocation will be about optimal for the estimation of the 
totals of domains cutting across strata, and for the proportions Y fl /Y, especially 
in those cases (which are of frequent occurrence) in which $ia is about pro- 
portional to the yi or &t of the size-group. In the estimation of the means of 
different domains, however, a greater proportion will require to be taken 
from the extreme size-groups, small as well as large. For estimation of the 
number of units in the different domains sampling will be optimal when// \/a 
all have approximately the same value. The best balance between these con- 
flicting requirements will usually be attained by increasing somewhat the 
sampling fractions for the size-groups with small y. This is in fact what was 
done in the National Farm Survey. 
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Example 9.5 

Using the data of Example 9 . 3, determine the relative precision of (a) the 
sample of that example, (b) a stratified random sample with uniform sampling 
fraction, and (c) a fully random sample, with regard to numbers of farms in 
the different domains, and their mean and total acreages. 

We will here outline the calculations for V (Ya). It is best, particularly if 
relative efficiencies require to be evaluated, to introduce the simplification used 
in arriving at expression 9.5.b. We then have 

V (Yn) = S (gi - 1) N w (q ia yia* + sta 2 ) (9 . 5 . c) 

The values of N/ a can be obtained from Table 9. 3. a by applying the relevant 
raising factors. The quantities qi a yia* + sta 2 can be calculated from Table 
9.3.b. Estimates of the proportions A/a, etc., of farms falling in the different 
size-groups are also required for each domain. All these quantities are tabulated 
in Table 9. 5. a. 

TABLE 9. 5. a VALUES OF Mm, ETC. 

Size-group N^ h^ tfw^ia 2 + s ic? 

2 720 0-677966 2041-6 

3 316 0*297552 18346-9 

4 26 0-0244821 1242054 



1062 1-000000 

For a stratified sample with uniform sampling fraction containing the same 
total number of farms g = 3296/614 = 5-36808. V(Ya) is given by formula 
9.5.C with all gi equal to g. Thus by summing the products of the second 
and fourth columns of Table 9. 5. a, and multiplying by g 1, we obtain 
V (Y fl ) = 4-36808 X 10497000 = 45850000. 

For the random sample the estimated variance s a * within domain A over all 
size-groups is required. The method of Section 8 . 3 (c) must be followed, 
The various terms in the formula for $ a 2 (formula for s 2 of Section S.3.c) can 
be determined from the values already given in Table 9.3.b, and the values 
of hta above. To avoid having to calculate yt a and y a to more decimal places 
than are given in Table 9.3.b, S htayta 2 /a 2 may be replaced by 



where y Q is a working mean, say 100, near y a . This comes to 3920-50. We 
also find sta* = 953-62 and S ha (1 hta) sitf/nta = 11*52. Note that here 
the original ma are to be used. Hence 

s a 2= 953-62 + 3920-50 11-52 = 4862-60. 
The estimated value of q a is 1 1062/3296 = 0-677791. Hence 

<?a> 2 + *a 2 = 10069-74, 
and thus from formula 9.2.p 
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V (Y fl ) = N a (g - 1) (qa > 2 + *a 2 ) = 1062 x 4-36808 X 10069-74 = 46710000,. 

For total acreage of domain A, therefore, the precision of the stratified 
random sample with uniform sampling fraction relative to that with the variable 
sampling fraction (Table 9.3.c) is 3394/4585 = 74-0 per cent. The precision 
of a fully random sample relative to a stratified sample with uniform sampling 
fraction is 4585/4671 = 98*2 per cent. There is, therefore, considerable gain 
by the use of the variable sampling fraction but very little gain by stratification. 

The full results for number of farms, total acreage and mean acreage are 
shown in Table 9 . 5 . b. For number of farms the stratified sample with uniform 

TABLE 9.5.b RELATIVE PRECISION (PER CENT.) OF DIFFERENT TYPES OF SAMPLE 

ABC whole 

A M C county 

No. of farms . 152 136 132 

u.s.f./v.s.f. <{ Total acreage . 74 65 62 84 

Mean acreage . 77 101 95 84 

No. of farms . 95 98 99 

Random/u.s.f. -{ Total acreage . 98 71 88 21 

Mean acreage .82 60 80 21 

sampling fraction is on the average about 40 per cent, more precise than the 
sample with variable sampling fraction. For total acreage, on the other hand, 
the sample with variable sampling fraction is about 50 per cent, more precise 
than the sample with uniform sampling fraction, the gain in precision being 
greater for the separate domains than for the whole county.* For the mean 
acreage the relative precision for the different domains is very variable. There 
is a gain by use of a variable sampling fraction for domain A but not for domains 
B and C. The random sample is always less precise than the stratified sample 
with uniform sampling fraction, but the gain due to stratification is very variable 
for the different measures and different domains. These results are, of course, 
what would be expected from the nature of the variances. 

The relative efficiencies will be somewhat nearer unity than the relative 
precisions. A simple method of calculating them is described in Section 10.12. 

9.6 A further example of the critical analysis of survey data : factors at 
two levels 

The following example of a pilot investigation of the effects of various 
factors on family size is due to N. Keyfitz (1953, C^f The investigation is of 

* This is less than the relative efficiency reported in Section 5 .21 because the smallest 
size-group has been omitted and there are no farms in the largest size-group in this 
county. 

fThis was first presented at a lecture given by Dr. Keyfitz at the London School of 
Economics. I am greatly indebted to him for providing me with details of the investi- 
gation in advance of its appearance in published form. For a fuller discussion the 
published paper should be consulted. 
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general interest, as it shows how the simultaneous effects of a number of 
quantitative factors can be studied by treating them as if they were qualitative 
factors each at two levels. It also provides an illustration of the possibilities 
of carrying out analyses of this kind on a small sample of census material 
analyses which would be quite intractable if the whole of the material were 
included. 

Table 9. 6. a shows the average number of children ever born per family 
and the numbers of families in a small sample from 16 counties in the Province 

TABLE 9. 6. a 1941 CENSUS OF CANADA: AVERAGE NUMBERS OF CHILDREN 

AND NUMBERS OF FAMILIES IN A SMALL SAMPLE CLASSIFIED IN SIX WAYS 



Present age 



Low income, French area : 
Far from, city . 
Near city 

Low income, mixed area : 
Far from city . 
Near city 

High income, French area : 
Far from city . 
Near city 

High income, mixed area : 
Far from city . 
Near city 



Low income, French area : 
Far from city . 
Near city 

Low income, mixed area : 
Far from city 
Near city 

High income, French area ; 
Far from city . 
Near city 

High income, mixed area : 
Far from city . 
Near city 



45-54 



55-74 



15-19 



Age at marriage 
20-24 15-19 



20-24 



0-6 7-f 



15 
5 



14 
3 



35 



9 
15 



Years of schooling 
0-6 7+ 0-6 7 + 



0-6 7-f 



Average number of children 


9-4 
7.4 


10-7 
12-9 


10-3 
8-3 


9-8 
6-7 


10-1 
10-0 


14-5 
11-0 


10-4 
7-6 


9-8 
8-6 


12-9 
9-7 


10-9 
11-3 


8-9 
9-4 


9-8 
7-1 


8-3 
9-0 


12-8 
9-9 


8-4 
8-6 


9-6 
8-6 


10-9 
8-3 


12-9 

8-7 


10-6 
7-1 


9-8 
10-3 


12-1 
10*8 


12-5 
13-2 


9-0 
10-9 


11-3 
9-9 


12-8 
10-5 


14-3 
12-2 


9-4 
7-6 


11-2 
8-8 


10-6 
11-0 


12-0 
11-0 


9-9 
8-6 


9-0 
8-4 



Number of families 



14 
8 



11 

7 



29 
15 



10 



35 
10 



15 
14 



24 

7 



14 
25 



20 
37 



21 
49 



29 

28 



13 
12 



18 
9 



16 
12 



31 
14 



14 

14 



15 
18 



34 

15 



16 
17 



22 
14 



9 

26 



12 
22 



17 
29 



27 
30 



4 
10 
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of Quebec. The data are taken from the census schedules of the 1941 census 
of Canada. 1,056 Roman Catholic, French-speaking farming families of French 
origin are included. The data are classified in a six- way classification according 
to all combinations of 



Wife's 



Present 


Age at 


Years of 


age 


marriage 


schooling 


45-54 1 


r 15-19 i 


r o~6 -\ 



X 



X 



55-74 



20-24 



Farm Relation 

income to city 

j" Low 1 f Far "] 

H x 

7-hJ [High] [NearJ 



X 



Type of 
area 

f French ] 
[ Mixed 



The classifications for income and relation to city refer to counties, and not 
to individual families. (Data for incomes could not be obtained for separate 
families without excessive labour.) The classification for area refers to sub- 
districts. All six classifications refer to quantitative factors. They have been 
converted to qualitative factors, each at two levels, by grouping the data ; the 
groupings chosen exclude entirely extreme values of some of the factors. 

Schematically the data now correspond to the results of a factorial experi- 
ment with 6 factors each at two levels. They differ, however, from experimental 
data in that the number of families, and therefore the accuracy, varies from 
cell to cell. Moreover since the classifications do not represent imposed ex- 
perimental treatments the conclusions are subject to the qualifications set out 
in Section 5.23. In this case also there is the further qualification that the 
last three classifications may be affected by other factors which are common 
to counties or sub-districts. 

In order to estimate the average effect of each factor separately, freed from 
the effects of the other factors, the method of weighted means of differences 
of sub-class means described in Section 5 . 23 (3) can be used. For each factor 
there are 32 pairs of values which differ in the factor in question and are the same 
for all the other factors. For relation to city, for example, the first pair of cells 
gives the difference (near far) 74 9-4 = 2-0, with weight 1/(1/15 + 1/5) 
= 3-75. The weighted mean of all the 32 differences for this factor is found 
to be 1-28, with total weight 234.* A pooled estimate of the variance within 
cells of the number of children per family can be calculated from the numbers 
of children in the separate families (not reproduced here). The value obtained 
is 18-15. The standard error of the above estimate is therefore 
V(18-15/234) = 0-28. 

The effects of the other factors may be calculated in a similar manner. 
The full results are shown in Table 9 . 6 . b. A plus sign in each case indicates 
that families at the higher (or indicated) level of the factor have the larger num- 
ber of children. It will be seen that four of the six factors show significant 
effects. 



* The form of these computations is set out in greater detail in Section 9 . 7. 
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TABLE 9.6.b EFFECTS OF THE six FACTORS 

Eliminating 

effects of Ignoring 

other factors other factors 

Present age . . . . + 0-38 0-27 + 0-30 0-28 

Age at marriage . . . . 1-77 0-28 2-02 0-28 

Years of schooling . + 0-72 0-28 + 0-16 0-28 

Income + 0-90 0-28 + 1-20 0-27 

Relation to city (near far) . . 1-28 0-28 1-58 0-27 

Type of area (mixed French) . 0-15 0-28 0-74 0-28 

The overall average effect of each factor, ignoring the other factors, is also 
shown in Table 9. 6. b for comparison. The greatest differences are in years of 
schooling and type of area. It should be noted that the estimates obtained by 
eliminating the effects of other factors do not necessarily give the best estimates 
of the total effects of the factors. If, for example, age at marriage tends to be 
increased by longer schooling the total effect of schooling may be small, although 
amongst women married at the same age those with longer schooling may be " 
more fertile. This question is discussed at greater length in Section 9.11. 

The above analysis is fully appropriate only when there are no marked 
differences (relative to the variability of the data) in the effects of each factor 
at different levels of the other factors. When there are such differences the 
effects are no longer additive and the factors are said to interact. When there 
are interactions then not only the estimates but the true values of the average 
effects will depend on the weights which are employed in arriving at the mean. 
It should be recognised that even in this case it is not incorrect to use the weights 
which give the most accurate estimates, but the quantities estimated are then 
in part determined by the actual (known) weights employed, which are them- 
selves dependent in part on the chance fluctuations of sampling which determine 
the numbers in the various cells. When the interactions are substantial, 
therefore, and comparisons between the estimates derived from different 
groups of the population are required, it may be better to obtain estimates 
using weights based on some standardised proportions in the different cells, 
since if this is not done the comparisons will contain components of interaction 
from which we may wish to free them. If the actual frequencies in the different 
cells are nearly proportionate, then weights based on proportionate frequencies 
may be taken as in Method 2 of Section 5.23. 

In general, interactions are only likely to be large if the factors concerned 
produce large average effects. Their existence and magnitude can be examined 
by the same method as that illustrated above for the average effects. To 
investigate the interaction of relation to city and schooling, for example, we 
take sets of four cells which differ in these two factors but have the same levels 
for all other factors. Each 2x2 group of cells in the table is such a set. For 
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the top left-hand group we have 

f Far from city 10-7 - 94 = + 1-3 
Effect of schooling j 

I Near city 12-9-74= +5-5 



Difference (near far) +4-2 

The weight of this difference is 1/(1/15 + 1/14= + 1/5 + 1/8) = 2-16. There 
are 16 such differences, and the weighted mean is found to be 040 with 
total weight 53-37, giving a standard error of 0-58. For the reason given below 
it is customary to define the interaction as one-half the difference of the effects. 
The estimate of the interaction is therefore 0-20 0-29. 

In this example only two of the 15 two-factor interactions are substantially 
larger than their standard errors. These are 

Age at marriage X schooling 0-67 0*29 
Present age X type of area 0-73 0-29 

The second may reasonably be regarded as a chance effect since neither present 
age nor type of area produce a significant average effect. The first, however, 
is between two factors both of which produce significant average effects, and 
may be taken to indicate that the effect of each factor is influenced by the 
level of the other factor. For two factors a and b which have average effects 
A and B and interaction A . B the effect of a at the lower level of b will be 

A -A.B 
and at the higher level will be 

A+A.B 

and similarly for b. Thus the effect of schooling at the lower age of marriage 
will be 

+ o-72 ( 0-67) = + 1-39 
and at the higher age will be 

+ 0-72 + (- 0-67) = + 0-05. 

The reason for the factor J in the interactions will now be apparent ; it makes 
the average effects and interactions directly additive. 

The results can also be put in the form of a 2 x 2 table. If M is an estimate 
of the mean, the cells of the table will be 



0+ 



~\AB 

- \B - %AB M + %A+^B + IAB 



In this case, taking the unweighted mean, 10-13, of the cell values for M we 
obtain 





Schooling 
0-6 7 + 


Age at 1 
marriage \ 


15-19 
20-24 


10-32 11-71 
9-22 9-27 
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Without knowing more about the data it is difficult to offer an explanation of 
this apparent difference in effect of schooling. 

It may be noted that if the interactions can be assumed to be negligible 
and there are more than two factors, the above method will not provide quite 
the most accurate estimates, even when all factors are at two levels only. An 
efficient process of estimation is then provided by an extension of the method 
of fitting constants, exemplified in Section 5.24. The gain in efficiency is, 
however, likely to be small unless the data are very fragmentary. In the present 
example the gains for the six factors ranged from 2 per cent, to 12 per cent. 
Such gains would not justify the additional computational labour, which is 
better devoted to extending the scope of the investigation in other directions. 

The procedure of grouping and working with factors at two levels provides 
an alternative to multiple regression analysis (Section 9.9). In data of this 
complexity, regression analysis would be exceedingly laborious, requiring the 
evaluation of 28 sums of squares and products and the inversion of a 6 X 6 
matrix. Moreover the regression technique does not readily lend itself to the 
investigation of the existence of interactions. 

It should be recognised, however, that the effects estimated by the above 
procedure do not give direct estimates of the regression coefficients (i.e. the 
change per unit change of the factor). Thus the difference between the two 
age-at-marriage groups, 15-19 and 20-24, is 1-77. We cannot assume that 
this represents the change resulting from a change of marriage age from 17-5 
to 22-5, since the marriages will not be evenly distributed within the groups 
there will be very few marriages, for example, at age 15. Instead an estimate 
of the change per year can be made by dividing 1-77 by the difference in 
mean marriage age for the two age-at-marriage groups taken over the whole 
of the data. A more accurate procedure is to calculate the difference d in mean 
marriage age for each pair of cells which goes to make up the weighted mean 
difference 1-77, and take a weighted mean of these differences (using the same 
weights as in the main calculation). This will provide the appropriate divisor 
for estimating the rate of change. If the relationship is really linear a more 
accurate estimate will be obtained by taking new weights equal to d times the 
old weights for both means, but the differences between the different d are 
not likely to be sufficiently large for this to be worth while. 

The efficiency of estimates of regression coefficients based on data grouped 
into two classes is not unduly low. If all variates are normally distributed and 
the divisions between the groups are located at the mid-point of each distribution 
an efficiency of 64 per cent, is attained when the regressions are truly linear.* 
There will be some additional loss if the divisions are not taken at the mid-points. 
In practice the effective efficiency is likely to be greater than normal theory 

* For a single independent variate a greater efficiency will be attained if the central 
portion of the distribution is rejected. The maximum efficiency with this procedure is 
81 per cent, when the central 46 per cent, of the distribution is omitted. This procedure 
is of no value with more than one independent variate, however, since the central values 
of the different variates will not appertain to the same units. 
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would indicate, since the occasions on which regressions are truly linear are 
somewhat rare. Moreover gross errors in the independent variate produce 
less disturbance in data grouped into two classes. When a considerable amount 
of data is available, as in the present example, there is no doubt that the gains 
from the more detailed analysis which are possible with the grouping method 
will far outweigh the theoretical gains in efficiency of the regression method. 

9.7 Qualitative data : separation of the effects of different factors 

When the effects of a number of factors on a qualitative variate are under 
consideration the approach by means of regression and fitting constants requires 
modification. The difficulty arises from the fact that when proportions are 
analysed different factors cannot be expected to produce a strictly additive 
effect, since the proportions themselves can only lie in the range 0-1. The 
effects may often be made more nearly additive by re-scaling the proportions, 
using some transformation which gives a transformed range from co to + co . 
The two transformations which are commonly used for this purpose are the 
Logit and probit transformations. 

The logit transformation is given by the equation 



and has been tabulated by Finney in Statistical Method in Biological Assay 
and elsewhere, where 5 is conventionally added to the y value to avoid negative 
values. A similar tabulation is given in Statistical Tables (5th Edition) without 
the addition of 5. The probit transformation is equivalent to the r, % transforma- 
tion, which is mainly used for transforming correlation coefficients, with 
r = %p - 1. 

The logit transformation has the property that for equal intervals on the 
logit scale the odds (Ap : Aq) are changed by the same factor. Since 
$ log, 2 = 0-34657 a change of this amount in the logit represents a doubling 
of the odds. Thus odds of 4 : 1 will be changed to 8 : 1, corresponding to a 
change of p from 0-8 to 0-889. If the logit scale (without the addition of 5) 
is altered to double logits (base 2) by multiplication by 2-885 (== 1/0-34657), will 
represent even odds^ + 1 odds of 2 : 1, +2 odds of 4 : 1, 1 odds of 1 : 2, 
etc., and a unit effect will correspond to a factor of 2 in the odds. This enables 
the results to be presented in a form in which their meaning is easily understood. 

The probit transformation is the normal deviate corresponding to the 
integral (taken from the left) of the normal curve with unit standard deviation 
(see Fig. 7 .3). It is therefore given by the inverse of Table A2, taking P = 2p 
and giving the deviate (probit) a negative sign when p < 0-5, and taking 
P = 2 (1 p) and giving the deviate (probit) a positive sign when p > 0-5. 
To avoid negative values 5 is conventionally added to the probit so obtained. 
Thus for a probit of 4-5, P = 0-6171 and p = 0-3086. Full tables are given 
in Statistical Tables and by Finney (loc. at.). 
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There is no reason to expect that the effects of factors will be exactly additive 
in the transformed data. As in quantitative data there may be interactions, 
which will indicate departures from the additive law. Moreover if the effects 
are additive for one transformation they will not be exactly additive for any 
other. In extensive work it may under certain circumstances be worth in- 
vestigating what transformation produces the closest approach to additivity. 
The logit and probit transformations, however, are the type of transformation 
which may under many conditions be expected to give approximately additive 
effects. For any but the most precise data they are sufficiently similar foi rt 
to be immaterial which is used. 

As an example of the type of analysis involved we may take the data given 
by Lombard and Doering (1947, A') obtained in the course of a survey on cancer 
knowledge. (See also Dyke and Patterson (1952, A'), where the data are 
analysed by means of logits, using a somewhat different method from that given 
here.) The object of the survey was to determine the influence of various 
possible sources of information on the knowledge of cancer. The individuals 
included in the survey were classified according to whether they read newspapers, 
listened to radio, etc., and also according to whether their knowledge of cancer 
was good or poor. The data so obtained are shown in the second and third 
columns of Table 9 . 7 . a, where the four factors are represented by 

a (newspaper reading) 

b (radio) 

c (solid reading) 

d (lectures). 

The corresponding logit or z values are shown in column 4. (These were 
taken from Dyke and Patterson's paper and are based on values of p to two 
decimal places.) 

TABLE 9. 7. a LOMBARD AND DOERING'S DATA ON CANCER KNOWLEDGE 

Weight Variance 



- 477 -176 --78 277 -0036 

a 231 -325 -37 202 -0050 

b 63 -206 ~-6S 41 -0244 

ab 94 -372 27 88 -0114 

c 150 -447 11 148 -0068 

ac 378 -532 + -06 376 -0027 

be 32 -500 -00 32 -0312 

abc 169 -604 +-21 162 -0062 

d 12 -167 -*81 7 -1429 

ad 13 -538 +-08 13 -0769 

bd 7 -571 +-14 7 -1429 

abd 12 -667 +-34 11 -0909 

cd 11 *273 48 9 -1111 

acd 45 -600 +-20 43 -0233 

bed 4 -250 --55 3 -3333 

abed 31 -742 +-52 24 -0417 
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The method of Section 9 . 6 for evaluating the effects of the separate factors 
can now be followed. Instead of a variance proportional to l/, however, a 
%- value based on n individuals will have a variance of l/4:pqn. These variances, 
calculated from the observed p's, are shown In column 5. The effects of the 
various factors (and, if required, their interactions) can now be estimated from 
the weighted mean of the individual differences. The calculations for A are 
shown in Table 9.7,b. The first two lines, for instance, give a difference of 
_ 0-37 ( 0-78) = 041 with variance 0-0050 + 0-0036 = 0-0086. The 
weighted mean, 4- 0*325, has an estimated standard error of \/(l/295). 

TABLE 9.7.b CALCULATION OF THE A EFFECT 

z Variance Weight 

a 4-41 -0086 116 

ob b + -41 -0358 28 

ac c + -17 -0095 105 

afc _ fo -j- -21 -0374 27 

a d - d + -89 -2198 5 

dbd -bd + -20 -2338 4 

acd cd + '68 1344 7 

abed - bed + 1-07 -3750 3 



4- 0-325 1-0543 295 

The values obtained are : 

#- values 2 logits (base 2) 

A + 0-325 0-058 + 0-938 0-167 
S + 0-147 0-062 + 0-424 0-179 
C + 0-500 0-055 + 1-442 0-159 
D + 0-223 0-098 + 0-643 0-283 

The effects of all four factors are significant. Their relative magnitude can 
be approximately examined by means of the standard errors shown. These 
indicate, for example, that solid reading (c) produces a larger effect than any 
other factor. The estimates are, however, not independent, owing to the 
inequality in the numbers in the different sub-classes. An exact test of the 
difference in effect of a and c, for example, can be obtained by testing the 
weighted mean of the differences c a over all combinations of the other 
factors. The calculations are similar to those shown in Table 9.7.b. 

In addition to estimates of the effects an estimate of the mean si may be 
required. This can be obtained from the unweighted mean of all the #'s. 
If the numbers in some of the cells are small, however, it is better to take the 
weighted mean of the #'s (with weights &npq). We here have S (wz)/S (w) = 
286-9/1443. This weighted mean requires adjustment if it is to represent the 
mean of a population with equal numbers in all cells. If S a +(w) is the sum of 
the weights of the cells with a, and Sa - (w) is the sum for the cells without a, 
the correction to S (wz) due to the a effect is \ A {S a +(w) Sa-(w)}. 
Here we have S a +(w) = 919 and Sa-(w) = 524. The correction is therefore 
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- | 0-325(919 524} = 64-2. The corrections for the 5, c and d effects 
are calculated similarly. Hence the corrected value of the mean is 

m = { - 286-9 64-2 + 52-0 37-8 + 134-8}/1443 = 0-140. 

The expected value of z in any cell can now be calculated. That for abcd> for 
example, is 

w+4^+JB+|C + |D=-- 0-140 + 0-162 + 0-074 + 0-250 + 0-112 

+ 0-458 

corresponding to a value of p of 0-714. For the nil combination p = 0-186. 

The above calculation is approximate, since the variances and weights 
have been estimated from the observed values of p. If any of the p j s are or 1 
the process breaks down entirely, since z is and its variance is also infinite. 
If the or 1 values arise from cells with small n they can be rejected (i.e. given 
zero weight) without serious error. If, however, there are cells with large n 
which have values of p equal to or very near or 1 a further adjustment will 
be needed. The procedure is similar to that which has become familiar in 
probit analysis. The necessary tables for both logits and probits, and also for 
the angular and log-log transformations, are given in Statistical Tables (5th 
Edition), together with examples. More details will be found in Finney's 
Statistical Method in Biological Assay. 

The important point to notice about the above method of analysis is that 
it provides quantitative estimates of the effects of the different factors. It 
therefore differs radically from the classical approach to the analysis of qualitative 
data by means of % 2 . The # 2 analysis provides tests of significance, but not 
estimates of the magnitude of the effects. The application of % 2 to multiple 
contingency tables of which the above is an example is, moreover, very 
complicated. 

9,8 Use of ratios and regressions in investigational work 

Ratios and regressions have many uses in investigational work. Whenever 
the nature of a pair of variates is such that the ratio between them may be 
expected to be relatively constant then the replacement of the variates by some 
estimate of the ratio is likely to simplify considerably the interpretation of the 
data. The choice of estimate depends on the nature of the variability. If x 
and y do not vary very widely then the unweighted means of the individual 
ratios r(=y/x) calculated from the pairs of values are likely to give the most 
accurate results. Under such circumstances the biases of the unweighted means 
are usually of relatively little importance in comparative work. Once the 
individual ratios have been calculated the unweighted means are less trouble 
to compute than any form of weighted mean, particularly when many alternative 
groupings of the data are required. If on the other band x and y vary widely 
(and particularly if there are some very small values) the estimates r = S (y)fS(x) 
may be subject to smaller errors, as well as being unbiased. 
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Several examples where ratios have proved of use have already been given. 
In Example 6.19 both types of estimate were examined. 

It is important to recognise that the use of a ratio does not imply that its 
value is necessarily constant over the range of x. If it appears desirable the 
relation between r and x (or r and y) can be examined, e.g. by the use of a 
regression of r on x (or r on y). It is sometimes objected that this procedure 
is inadmissible since random errors in x will give rise to a negative correlation 
between r and x. This objection is, however, only valid if x and y are in fact 
subject to random errors which are independent. In a study of the relation 
of earnings to factory size in an industry, for example, the size, as represented 
by the number of workers, will usually be correctly ascertained, and the relation 
between the average earnings per worker and size will therefore not have any 
spurious component of correlation ; by taking earnings per worker and size 
instead of total wage-bill and size as the variates for analysis we obtain the 
data in a form which is considerably easier to study. 

When the value of the ratio changes considerably over the range of x it 
may be more appropriate to use the regression of y on x. Either a linear or a 
curved regression may be used. The calculation of a linear regression has been 
described in Section 6.12. 

In investigational work we may be interested not only in the relation of 
y to x over the whole population, but also in differences in this relationship 
for different domains of study. In such cases ratios or regression lines can be 
calculated for each domain separately. For comparative purposes slight 
departure from the assumed law is often of little consequence. Thus provided 
the mean of x is similar for the different domains linear regressions may be used 
when some degree of curvature in the lines is apparent from the data. If, 
however, the means of * differ considerably the slopes of the linear regressions 
will differ, although the whole of the data may in fact be adequately described 
by a single curved regression line. 

A full discussion of the use and interpretation of regression analysis is not 
possible here. One or two points may, however, be mentioned. 

If regressions are calculated for different domains of study which are, for 
example, parts of a random sample, or strata of a stratified sample, we shall 
obtain regression equations of the form : 

Domain A : y l = ya + l>a (x xa) 
Domain B : y l = j>& + bt> (x xt) 

etc. To enable these equations to be compared directly x a , xt, etc., must be 
replaced by some standard value XQ which should be chosen conveniently near 
the general mean. The equations will then become 

y l = Xco + ba(x XQ) 
y l = yao + bb (x xo) 

etc., with y a o = ya + b a (XQ x a ), etc. The formula for the error of y a o, etc., 
has been given in Section 7.12. The differences between the standardised 
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values y a o, etc., will represent the differences between the domain /s for the 
standard value xo of x. If b a , bb, etc., differ substantially, the differences 
between y a o, y&o, etc., will depend on the value chosen for xo. If there is no 
evidence of differences between the b's then a mean b can be taken (in which 
case all the regression lines will be parallel). The mean b can be calculated 
from the formula of Section 6 . 13. The question of whether the b'$ differ can 
be examined by means of the standard errors of b a , bb, etc., calculated from 
the formula of Section 7 . 12. The analysis of variance can also be used in this 
connection, but the procedure is rather complicated for those not familiar 
with the technique. (See, for example, Quenouille, Associated Measurements.) 
A common tactical error made by those unfamiliar with regression work is 
to take XQ as zero, and re-write the regression equations in the apparently simpler 
form 

y t = a a + b a x 
y l == ab + bb x 

etc. Unless the means of x are near to zero these equations are unsuitable 
for comparison, for although a a , ab, etc., give the estimated values of y for 
x = they are subject to large errors, both because of errors in b a , fa, etc., 
and because although the assumption that the regressions are linear may be 
reasonably correct over the range of x actually covered, it may be by no means 
correct if the range is extended to zero. 

When comparing regression lines it is often useful to plot all the lines on 
the same graph. Relations which are implicit in the equations will then be 
immediately apparent. 

Regressions can easily be calculated from grouped data. If the data are 
grouped for x only and the mean value of y is calculated for each group, the 
plot of these values (apart from sampling errors and a small error introduced 
by the grouping) represents the regression of y on x. With a large sample 
this is quite a good way of examining the regression. At the same time the plot 
will reveal whether the regression is truly linear. It is often advisable, however, 
to make an exact calculation of the regression from the group means, rather than 
to draw a line by eye, since it is difficult otherwise to make proper allowance 
for the varying numbers of observations in the different groups. If n l9 n 2) ... 
represent the numbers in the groups, x lt # 2 , . . . the values of x for the mid- 
points of the groups, and y lt y< 2 , . . . the means of y, the formula for the 
regression coefficient will be 



where n = S (n r ), x = S (n r Xr)/n, y = S (n r yr)/n. 

With two variates there are two regression lines, that of y on x and that of 
x ony. The data themselves will not tell us which, if either, of these regression 
lines is appropriate for our purposes. If we wish to estimate the values of y 
for individual units for which only the value of x is known, or the mean y 
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for a set of such units, then the regression of y on x will give the appropriate 
estimation equation, provided the units can be regarded as further random samples 
of one unit each from the population of which the data are a sample. This is the 
justification for using the regression method in the manner outlined in Chapter 6 
for improving the accuracy of a sample for which supplementary information 
is available either from the whole population or from the first phase of a two- 
phase sample. For estimation of this type, errors in x can be ignored in cal- 
culating the regression. If x is subject to large errors the numerical value of 
the regression coefficient will be reduced. We shall consequently estimate 
the unobserved y to be closer to the mean of the already observed y than 
would be the case if x were free from error. That this is as it should be is 
obvious if we consider the limiting case where x is subject to such large errors 
as to be worthless. In this case the mean value of the observed y provides 
the best estimate of all unobserved y. 

If, on the other hand, we are concerned with the estimation of the mean 
of them's of a new sample which, although possessing certain features in common 
with the original sample on which both x and y were measured, cannot be 
regarded as a random sample from the same population, the whole situation 
is altered. We then have to consider in much more detail the nature of the 
measured variates and the errors affecting them. This situation is discussed 
in more detail in Section 9.10. 

If the regressions are being used in investigational work we shall not usually 
be concerned with problems of the estimation of further values of y from 
observed values of x. We shall rather be concerned to evaluate the underlying 
laws which govern the relationships between various variates. In this case 
if y is believed to depend causally, in part at least, on x we shall normally 
require the regression of y on x. This will give the relation between x and the 
mean value of y for given x. The variation of the actual y s about the mean 
value of the y's for a given x may then be attributed to the influence of other 
variates on y and to random errors in y (of observation, etc.). If * is subject 
to error a correction to the regression coefficient will in this case be required, 
as is explained in Section 9 . 10. 

9.9 Multiple regression 

Instead of taking a regression on a single variate it is possible to take a 
multiple regression on two or more variates. The formulae will be given for 
two variates ; they can easily be extended to more variates when required. 

To shorten the formulas we may write S n for S (x 1 x^, S u for 
5(*i Xj) (x 2 x 2 \ Siy for S(xi xj (y y), etc. If x^ and # 2 are the 
two independent variates the regression can be written in the form 

Vi = y + *i (*i - *i) + *2 (*> - * 2 ) (9.9.a) 

The regression coefficients b t and & 2 are given by the solution of the two 
simultaneous linear equations (known as the normal equations) : 
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= S ly 



ii?..Dj 



In order to obtain the standard errors of b 1 and & 2 , or any linear function of 
b and b zy it is necessary to perform an operation known as inverting the matrix 
given by the coefficients of the b's in the above equations. This is equivalent 
to solving the two pairs of equations of which the left-hand sides are those of 
the normal equations, but which have as numerical terms 1, and 0, 1 respec- 
tively, instead of Siy, S% y . The values of b I and b z given by these equations are 
commonly denoted by c u , 12 , and c 21 , c zz . The c's form the inverse matrix, 
which has diagonal symmetry, so that c 12 = 21 . This property considerably 
reduces the labour of inverting a large matrix. Methods of performing the 
inversion expeditiously when there are a number of variates are described in 
many statistical textbooks (e.g. Statistical Methods for Research Workers) and 
will not be given here. Whatever the method followed the values obtained 
should be substituted in the original equations to see that they are really 
satisfied. 

When the c's have been obtained the &'s can be calculated from the formulae 

bi = Cii Sly + 12 Sty} ,g Q , 

b 2 = Ciz Siy + 2 2 ^2yj 
The sum of squares of the deviations from the regression line will be 

Q = S(y - y z ) 2 = Syy - bi Si y - b* S Zy (9.9.d) 

Since two degrees of freedom have been absorbed by the regression line, and 
one degree of freedom because deviations from the mean have been taken, the 
total number of degrees of freedom remaining will be n 3. The residual 
variance of a single observation after fitting the regression is therefore 

*,' = Ql(n - 3) 
We then have 

V (&i) = <?ii */ 2 V (i a ) = 22 sf coy (b v 6 2 ) = c 12 sf 
Hence the variance of any linear function is given by 

V (/j. 6 X + l z b z ) = ft 2 c u + 2/i /, c u + I? c^) t? (9 . 9 . e) 

The reader should verify that if # 2 is omitted from the above formulae the 
formulae already given in Sections 6.12 and 7.12 for a regression on a single 
variate are obtained. 

The above formulae can be adapted to the fitting of curvilinear regression 
lines. If, for example, we require to fit a quadratic 

y = a + bx + c ^ 

a multiple regression on x and x* can be calculated. Since x and # 2 are highly 
correlated if all x are of the same sign it is better for purposes of computation 
to take (x x ) 2 for the second variate, x being a convenient value of x near x. 
If the values of x are equally spaced, curvilinear regressions can be con- 
veniently fitted by the use of orthogonal polynomials. The necessary tables 
and instructions for their use will be found in Statistical Tables. Their main 
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use in survey work is in the analysis of data from surveys carried out on successive 
occasions, with time as the independent variate. 

9.10 Regression : effect of random errors in the variates 

Random errors in the dependent variate make the regression coefficients 
less precise, but no consistent error is introduced : as the sample size is increased 
the coefficients tend to the underlying population values. If, on the other hand, 
there are random errors in the independent variates the estimated coefficients, 
regarded as coefficients of an underlying regression law, are subject to consistent 
errors which do not decrease as the size of the sample is increased. In the case 
of a single independent variate with errors of x uncorrelated with those of y y 
a consistent estimate b' of the coefficient ft of the underlying regression law is 

V = b/(l -A) (9. 10. a) 

where h is the ratio of the error variance of x to the total variance of x (including 
the error variance). Thus the regression coefficient b calculated in the ordinary 
manner is on the average too small in absolute magnitude, and is said to be 
attenuated. 

In the case of a multiple regression with independent errors in the different 
#'s the estimation of the coefficients of the underlying regression law will be 
obtained by replacing S llt S M , etc., in the normal equations by S l:L (1 AJ, 
22 (1 A 2 ), etc. If the errors of the x's are correlated similar corrections are 
required for S 12 , etc. If the errors of the x's are correlated with those of y 
the terms Siy, etc., must also be corrected. 

These adjustments can only be made if the relevant error variances and 
co variances can be estimated. If they arise solely from sampling errors this 
will often be possible. Unfortunately errors in the independent variates are 
not confined to sampling errors. Errors of observation and measurement, and 
failure of the chosen measures to represent what is really required, will produce 
similar disturbances. Errors of observation and measurement frequently 
require supplementary investigations if they are to be assessed, though in some 
cases, as in the results described below, they will be included in the sampling 
errors as ordinarily calculated. Failure of the chosen measures to represent 
what is really required is much more troublesome and the amount of the 
disturbance cannot ordinarily be assessed. (An example where this point may 
arise is considered in the next section.) 

An example of a case in which adjustments for attenuation were necessary 
is provided by a survey recently carried out in England and Wales on potatoes. 
In this survey the yields were estimated by taking a sample of about 35 fields 
per county and lifting and weighing small sample lengths of row from the selected 
fields (Boyd and Dyke, 1950, H / ). The means of the sample yields and the official 
estimates (tons per acre) for the surveyed counties for the three years are shown 
in Table 9.10, and Fig. 9.10 shows the Ministry of Agriculture's estimates 
of the yields of the surveyed counties plotted against the sample estimates. 
The top broken line is the line on which all points should fall if there were no 
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TABLE 9.10 YIELDS (TONS PER ACRE) AND REGRESSION COEFFICIENTS IN THE 

POTATO SURVEY 





Sample yields (#) 


Official estimates (y) 


Regression 
unadjusted 


coefficients 
adjusted 


1948 


9-35 


7-69 


0-365 


0-457 


1949 


7-52 


6-30 


0-520 


0-606 


1950 


9-48 


7-59 


0-415 


0-530 



errors in either the official or the sample estimates. The sample estimates may 
be taken to be virtually free from bias, but they are subject to random sampling 
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FIG. 9.10 POTATO SURVEY : THE RELATION BETWEEN OFFICIAL ESTIMATES AND 

SAMPLE YIELODS OF COUNTIES FOR 1948, 1949 AND 1950 (TONS PER ACRE) 

errors owing to selection of fields within a county, and to the sampling of the 
selected fields. 

It appears from the figure that there is a tendency to underestimate counties 
with high yields. This tendency can be evaluated quantitatively by calculating 
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the regressions of the official estimates on the sample estimates. The thin 
lines on the diagram give these regressions for the three years. The coefficients 
are given in Table 9.10. Since, however, the sample estimates are subject 
to random errors these regressions require adjustment if they are to represent 
the average values of the official estimates for given values of the true county 

yields. 

The thick lines on the diagram and the coefficients in Table 9 . 10 give these 
adjusted regressions. Apart from random errors of estimation they represent 
the lines that would have been obtained if the yields of all the fields in each 
county had been determined without error. The calculated regression line for 

1948, for example, was 

y 7.69 + 0-365 (x 9-35) (9.10.b) 

The average error variance per county was 0-398. This includes both the first- 
stage component due to selection of farms, the second-stage component due 
to the selection (where necessary) of fields on the selected farms, and the third- 
stage component due to the sampling of the selected fields. It is calculated from 
the within-counties variance of the mean sample yield per farm. The total 
variance of the county sample estimates was 1-968. Hence h = 0-398/1-968 = 
0-202, and V = 0-365/(l 0-202) = 0457. The adjusted regression line is 

therefore 

y = 7.59 + 0-457 ( - 9-35) C9.10.c) 

It will be seen that part, but by no means the whole, of the apparent under- 
estimation of high yields can be attributed to random errors in the sample 
estimates. The lines for the three years have also been brought into somewhat 
closer agreement by the adjustments, for although the adjustments to the three 
regression coefficients are very similar, the lower mean yield of 1949 has raised 
this line relative to the others. 

In 1948 the estimates were provided by the Ministry's Crop Reporters, 
but in 1949 and 1950 the duty of making estimates was transferred to the 
National Agricultural Advisory Service. The close agreement of the adjusted 

lines the differences are no greater than would be expected from random 

errors demonstrates the very similar behaviour of the two different groups of 
reporters. The line obtained by taking a weighted mean of these lines (by a 
procedure we need not describe) is 

y == 7.19 + 0-563 (x - 8-78) (9.10.d) 

These results may be used to establish a formula for correcting official 
estimates in future years. This is not so simple a problem as it appears ^at 
first sight. If both official and sample estimates were available for all counties 
over a number of years, the regression of the sample yearly means xt on the 
official yearly means y\ would provide the appropriate equation of estimation, 
at least in so far as future years could be regarded as a random sample from the 
same population as the years in which the samples were taken. No adjustment 
for errors in x would be required, since xt is here the dependent variate. The 
present results, however, only provide data for a selection of counties for three 
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years, and the regression of xt on yt will therefore be too ill-determined to be 
of any value. Consequently we have to consider whether it is possible to 
establish this regression line indirectly. 

In the light of the results obtained we may tentatively assume : 

(1) The official county estimates are distributed about the mean adjusted 
regression line of y on x given by equation 9 . 10 . d with a residual variance 
estimated at 0-624.* 

(2) There may be an additional common component of error affecting all 
the official estimates of a particular year, but in view of the closeness of the 
adjusted regression lines of y on * for the three years this component is likely 
to be small. 

We shall also assume, in order to simplify the discussion, that the errors 
of a particular county about the regression line are independent from year to 
year. This is not likely to be wholly correct, since official estimates for a 
particular county are for the most part made by the same reporters in successive 
years. 

If xt ' is the true mean yield of all counties for the year f, and if the common 
component of error under assumption (2) is negligible, the points (/, yt) 
representing the yearly means will deviate from the regression line only by 
an amount due to the random errors arising from assumption (1). The official 
estimate for the country is a weighted mean of the county estimates, with 
weights w proportional to the county potato acreages. Using the county acreages 
for 1942 (which has a similar total potato acreage to 1950) we find 



and consequently from formula 7 . 5 . e the variance of yt about the regression 
line is given by 

V r (yt) = 0-0302 X 0-624 = 0-0188. 

We also require estimates of the mean and variance of the yearly means 
of either the true yields or the official estimates. No reliable estimate can be 
obtained from the present data, since only three years are available, but one 
can be obtained from the values of the official estimates over past years. Taking 
the last 20 years for which data are readily available (1930-1949) we obtain 

y t = 6-80 V (yt) = 0-333. 

From the adjusted mean regression line (9.10.d) the corresponding mean 1*' 
for the true yields is 8-09. 

If yt represents the value of y for the point on the regression line corres- 
ponding to xt ' we have 

V (y t ) = V (yt + V r (yt) = b'* V (*/) + V r (yt) 
and hence 

* This is calculated in the ordinary manner except that b is replaced by b' in the 
formula for Q (Section 7 . 12). 
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OV (A', ft) - b' V (') = ~ ! V (j ( ) - Vr (j 



where A = Vr (#)/V (#) = 0-0188/0-333 = 0-0565. 

The regression coefficient of %i f on yt (referred to the y axis) is therefore 



_ 1 0-0565 

0-563 
= 1-676 

The regression equation passing through the point (f t, %) will be 
x/= 8-09 + l-676(# 6-80) 

= 843 + 1-676 (y t 7-00) (9 10. e) 

This is not shown in Fig, 9.10 but is almost the same as the 1949 adjusted 
regression, 

If we had taken the regression of all the observed x on the observed y 
(disregarding the year classification) we should have obtained the equation 
(line F in the figure) 

^ / =8-88 + 0-966 (yi~ 7-27) 

= 8-62 + 0-966 (yt 7-00) (9.10.f) 

This differs considerably in slope from the regression given above. If only 
the data for the years 1948 and 1950 had been available the difference would 
have been much greater, the line (line G in the figure) being 
x t ' = 941 -f 0-593 (yt 7-65) 

= 9-02 + 0-593 (yt 7-00) (9.10.g) 

whereas the procedure giving equation 9 . 10 . e would have given the line 
%' = 7-64 + 1-953 (yt - 6-80) 

= 8-03 + 1-953 (yt - 7-00) (9.10.H) 

This line is also not shown in the figure but is nearly the same as the 1950 
adjusted regression. The procedure therefore gives a relatively stable line. 

If there is an additional common component of error under assumption (2), 
the slope of the regression equation (relative to the j-axis) should be decreased, 
but since we have no means of estimating this component the amount of the 
decrease cannot be assessed. We might, however, adopt the simple compromise 
of taking the regression of xt on yt passing through the origin. For this 
regression 



Hence the line (line / in the figure) is 
/ = 1-222J/* 

= 8-55 + 1-222 (y t -7-00) (9.10.i) 
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Apart from any additional common component of error, the success of equation 
9 . 10 . e in future years will depend on how far the reporters continue to make the 
same type of error as they have in the past. If they become aware of their present 
tendency to underestimate high yields they may endeavour to improve their 
estimates. This will, of course, vitiate any adjustments based on equation 
9.10.6. 

The reader will find it instructive to calculate the predicted values of #/ 
for the three years for which data are available. 

9.11 The interpretation of multiple regression 

The interpretation of the results of multiple regression analysis requires 
the greatest care. Nothing is easier than to reach false conclusions. The first 
point to remember is that all regression and correlation analysis merely deals 
with associations. By itself it tells us little of the causative factors that are 
operating. Fortunately we are frequently in a position to make at least tentative 
assumptions about the actual causative system. When this is possible a regres- 
sion analysis can, under favourable circumstances, confirm or disprove our 
assumptions, and provide estimates in quantitative terms of the effects of the 
different factors. 

As a specific example of the types of problem involved we may take the case 
of a survey of housing conditions conducted with the object of finding the 
influence of such conditions on the health of the occupants. It was observed 
by M'Gonigle and Kirby (1936) that the health of the inhabitants of " an 
unhealthy area " of Stockton-on-Tees deteriorated when they were rehoused 
in a self-contained municipal housing estate, owing to the fact that families 
moving to better houses had to spend a greater part of their total income on 
rent, etc., to the detriment of their general living standards, and particularly 
of their nutrition. On the other hand, in an investigation in Newcastle-upon- 
Tyne during the depression, which revealed the alarming difference in health 
between children of working-class (largely unemployed) parents and those of 
middle-class parents, Dr. J. C. Spence (reported by M'Gonigle and Kirby) 
came to the conclusion that the main factors responsible for the difference 
were 

(a) The housing conditions, which permit mass-infections of young 

children at susceptible ages. 
(Z>) Improper and inadequate diet, which prevents satisfactory recovery 

from their illnesses. 

He further states : " It is probable that these two factors are of equal importance ; 
but I would suggest that opinion on this matter should be reserved until a 
full inquiry, carried out by competent observers in a scientific manner, has 
studied the problem more closely." 

We will consider how this situation is likely to be reflected in the results 
of a survey of a group of the population subject to different housing conditions 
but otherwise relatively homogeneous. For purposes of discussion we will 
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assume that information is available on income as well as on housing conditions 
and health, but that no information is available on standard of nutrition, etc. 
It is reasonable to suppose that income affects housing conditions in that those 
with larger incomes will be in a position to obtain better housing, but that 
housing conditions do not exert any appreciable influence on income. Both 
income and housing conditions may be expected to affect health, income 
operating through housing conditions which are observed, and through other 
factors, such as nutrition, which are not observed. If U represents total income, 
V housing conditions and Y health this causative system can be represented 
by the following diagram : 



The arrow between U and Y here represents the " net " effect of income on 
health, i.e. the effect of income other than that due to change in housing 
conditions. 

This leads to the concept of net income, i.e. income after deduction of rent 
and other charges associated with a given type of housing. This is the part 
of the income which is available to produce the net effect of income on health. 

To simplify the discussion consider only families of a given size and com- 
position. Take u to represent the total income of such a family, u n the net 
income, v the index of housing conditions, and y the health index. If, within 
the ranges covered by the variates, the causative relations are linear with a super- 
imposed random component, the equations representing the above causative 
system may be written 

v . v = y (u o) + *i (9 . 11 . a) 

y y = a (u n Una) + ft (v V ) + % (9. 11 .b) 

where ^ and <? 2 are random components, the Greek letters are numerical 
coefficients, and u 0) u n o> v , yo represent a set of values of u, u n , v, y near their 
means which conform to the linear relationship defining the causative system. 

The coefficient a represents the average increase in health index that ma}- 
be expected if incomes are raised one unit but people are prevented from 
spending any of this additional income on improvement of housing conditions. 
Similarly the coefficient /? represents the average increase in health index that 
would result from an improvement of housing conditions if this improvement 
entailed no additional charges either direct or indirect on the occupier. The 
coefficient y represents the average increase in the housing-condition index 
that may be expected to result from unit increase in income. 

If information has been collected for the individual families on all the 
necessary points, the net income of each family can be calculated directly. 
In this case this variate should be used. Here, however, we wish to consider 
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the case in which detailed data of this type are not available, but sufficient 
information is available to make an estimate of the average charge k on the 
total income resulting from a unit increase in the housing-condition index, 
We then have 

U n Uno = (U Uo) k(v Vo) + e z (9.11. c) 

The equation for y may now be written 

y _ y = a (u u ) + (ft ka) (v v ) + 2 + a e s (9 . 11 . d) 

The basic data will be in the form of observations on the three variates, 
11, v and y. For purposes of analysis such data may well be condensed by 
grouping over income classes and over housing conditions so as to form a 
two-way table with income and housing condition as the two classifications, 
the entries in the table being the mean health indexes for ail the families 
belonging to that cell. An auxiliary table giving the number of families in 
each cell will also be required. 

If, now, we calculate the multiple regression of y on u and v we shall have 
an estimate of the constants of equation 9 . 11 . d. If this regression is 

y == y + a(u 5) +b (v v) (9.11.e) 

a provides an estimate of a and b of /3 ka. Consequently the direct effect of 
housing /? is estimated by b + ka. 

If also, the regression of v on u is calculated, and found to be 

v^v + c(u u) (9.11.f) 

c provides an estimate of y. 

The total regression of y on w, 

y=y + a'(u-S) (9-11-g) 

is also of interest, as will be explained later. 

When this causative system is operating, therefore, the partial regression 
coefficient b of health on housing conditions, with total income as the second 
independent variate, does not estimate the direct effect of change of housing 
conditions on health. It represents the net effect, which is the difference in 
this direct effect and the effect of the reduction of other aspects of the standard 
of living due to having to spend more on housing. If there is compulsory 
improvement of housing, by slum clearance schemes and the like, of amount 
6v y without improvement of income either direct or indirect (e.g. by tent 
subsidies), an improvement of health of bdv may be expected. On the other 
hand, if the full additional cost of the housing to the occupiers is covered 
by subsidies or other means, an improvement of (b + ka) dv may be expected. 

If incomes are raised by an amount du and the situation is such that housing 
conditions and rents cannot change, an improvement in health of adu may be 
expected. If, however, the increased incomes are allowed to produce their 
natural effect in improving housing conditions, this improvement may from 
equation 9.11.f be expected to amount to cdu. The expected improvement 
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in health, using equation 9.11.d, will then be 
a (du kc du) + (b + kd) cdu = (a + be) du 



a S (u - u) 2 + b S(u - H) (P - 5) 



S'(u w) (y f ) j, , * 

_ii - Li - OU = a OU 

~ 



. 

u 



the last line being derived from equation 9.9.b. The required increase is 
therefore given by the total regression of y on u, as might be expected. 

The above interpretation can be accepted without qualification only if the 
causative system really conforms to the postulated model. In practice there are 
likely to be departures from such a simple model, many of which will introduce 
serious disturbances which may entirely vitiate the conclusions. 

In the first place there will usually be external causative agents, not included 
in our regression system, which affect the various variates. In the above example, 
for instance, the level of the education of the adult members of the household 
may be expected to affect income. It may also, to a less extent, affect housing 
conditions (apart from influence due to income). Provided education does not 
affect health directly, these influences will not disturb the partial regression 
coefficients of y on u and v or their interpretation, but they will affect the 
regression coefficient of v on u. This latter will now represent the sum of the 
effects of a direct increase in income and of the associated average increase in 
educational level. The coefficient will therefore no longer give an estimate of 
the increase in housing conditions that may be expected from a rise in income 
level of an individual whose educational level remains unchanged. 

If, on the other hand, educational level affects health directly, as well may 
be the case, the partial regression coefficients of health on income and housing 
conditions will be similarly disturbed. 

These disturbances can theoretically be eliminated and the effects of edu- 
cation measured if an appropriate measure of education is available. All that 
is necessary is to include a term for education in the regression system. In 
practice, however, it is not possible to eliminate all disturbances of this kind, 
because of the number of variates that may be involved, because some of them 
may not be measured or may be unmeasurable, and because correlations between 
them prevent the separation of their effects. 

A further complication which affects interpretation of regression equations 
arises when there is a two-way causal relationship. In the example considered, 
ill-health, if continued over any long period of time, may undoubtedly be 
expected to depress income. Consequently the association between income 
and health arises not only from the influence of income on health but also from 
the influence of health on income. The only way of attempting to assess the 
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magnitude of disturbances of this kind and to eliminate their effects is by more 
detailed observations of different parts of the material. If, for example, the 
health index is calculated from the health of the occupants of the house other 
than the wage-earner it may be hoped that the components of this index which 
affect income will be much reduced. 

Errors in the independent variates can also affect the results. In so far 
as these are random and their variances and covariances are known they can 
be allowed for by the method of Section 9.10. But this will not be the case 
in the problem we are considering. There will in fact be many aspects of 
housing which may affect health, often in different ways. These may be 
inadequately recorded (and indeed exact assessment of some aspects may be 
almost impossible) or the index may be imperfectly chosen.* Our regression 
will then not measure the full effect of housing conditions, but only the effect 
of such conditions as are correctly summarised in the index. 

This complicates the issue in another way. If income is accurately measured 
but housing conditions are inaccurately measured and the causative system 
shown above is operating, some of the effect that should be attributed to housing 
conditions will appear as a direct income effect. In the extreme case, for 
example, where the chosen housing index bears no relation to the housing 
conditions which affect health, and is uncorrelated with income, the partial 
regression coefficient of y on v will be zero, and that of y on u will be equal 
(except for random errors) to the total regression coefficient of y on u. 

When the chosen index measures some aspect of housing which is closely 
correlated with income but which does not affect health, the total effect of 
income will be divided between the partial coefficients of y on u and v in a 
manner which depends to a large extent on the chance distribution of error. 

From the above discussion it can be seen that the use of the regression 
method in the interpretation of survey data is fraught with hazards. In part 
these hazards arise from the fundamental weaknesses of observational material 
stressed at the beginning of Section 5 . 23, but in part they can be attributed 
to an over-simplified approach to the problem. Housing conditions can vary 
in manv ways, and ill-health can take many forms. The more precise and 
detailed the observations, the more relevant the quantities observed to the 
causal systems believed to be operating, and the greater our knowledge of the 
causal systems themselves, the more confidence can we have in our conclusions. 
Thus, any detailed analysis of the effect of general housing conditions on 
health generally must be extremely tentative, and may well, in the light of the 
above discussion, be judged to be not worth while. On the other hand, if 
we are dealing with a specific disease, such as dysentery, known to be spread 
under insanitary conditions, and if we can get a direct measure of these in- 
sanitary conditions and can show that the incidence of dysentery is closely 

* Given ample data, statistical procedures are available for choosing the best index. 
In fact, all that is necessary is to include the different components of the index^ as separate 
terms in the regression equation, but the additional computational work involved in 
such a procedure will not ordinarily be justified. 
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related to them, then we shall feel ourselves on much safer ground in drawing 
the conclusion that if these conditions are remedied we may expect to see a 
considerable fall in the incidence of dysentery. 

If, at the same time, our investigation is extended to other diseases, and 
these are shown to be related to the types of condition that are known to favour 
them, our confidence in the validity of all our conclusions will thereby be 
strengthened. Measuring a single association and drawing an isolated con- 
clusion from it does not, in fact, constitute good investigation al work. Such 
work is much more of the nature of a detective inquiry, in which all the separate 
pieces of evidence are assembled and fitted together. If and only if they form 
a coherent picture are we entitled to have confidence in our conclusions. 
Statistically, therefore, such work is much more difficult, and requires much 
more critical ability, than the analysis of experimental data, where the effects 
of separate factors are deliberately isolated in the planning of the experiment. 

In the above discussion we have considered the simple case of a regression 
analysis with two independent variates. If the data are extensive the regression 
of health on housing conditions can be calculated separately for each income 
level. The partial regression coefficient b will be a weighted average of these 
regression coefficients. Similarly the partial regression coefficient a is a 
weighted average of the coefficients of the regression of health on income for 
different levels of housing conditions. The multiple regression method, 
therefore, provides an automatic averaging of the separate regression lines for 
different parts of the data. If examination of the data indicates that there are 
real differences of a meaningful nature in the separate regression lines, such 
averaging will be inappropriate. 

We have thought it worth while to consider a specific case of regression 
analysis in some detail because of the very real dangers of misinterpretation 
in investigational work. We have taken the case where the underlying causal 
relationship may be considered to be of bivariate linear form. The same 
considerations and qualifications apply when some or all of the variates are 
qualitative. Indeed the analysis of such data by the method of fitting constants 
(exemplified in Section 5 . 24) is very similar in principle to multiple regression 
analysis. If the data, whether quantitative or qualitative, are extensive the 
necessary examination can frequently be made by comparisons of the means 
of various groups instead of formally calculating regressions or fitting constants 
(see Sections 9,6 and 9.8). Whatever the method employed, however, the 
interpretation of the results is governed by the same principles. 
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CHAPTER 10 
MISCELLANEOUS DEVELOPMENTS 

10.1 Machine developments 

There is not space here to give an account of all the most recent developments 
in tabulating machinery. Improvements and modifications are continually 
being introduced by the various machine companies. Moreover the development 
of the machines of the different companies of the Hollerith group is following 
somewhat different lines. The description given in Sections 5.11-5.16 refers 
to British machines and is not applicable in detail to the machines of other 
companies. The companies should therefore be consulted when any large 
tabulating job is to be undertaken, particularly if the installation of new equip- 
ment is contemplated. 

There are, however, two developments in punched card machinery, which 
have proved of particular value in census work, and which may therefore be 
mentioned here. The first, mark sensing y was briefly referred to in Section 5.9. 
The view there expressed that it was not likely to be of much value in census 
work has proved to be incorrect. It is, for example, successfully used in the 
Canadian Labour Force Census which is carried out on a sample at intervals 
of three months (Keyfitz and Robinson, 1949, F'). In ordinary mark sensing 
specially printed Hollerith cards are used. For each column to be punched a 
mark is made in one or more of twelve positions, and the cards are subsequently 
passed through a machine which reads these marks and punches the information 
on the same card. In the Canadian Labour Force Census special record cards 
are used, both sides being marked by the field officer. The only coding done in 
the office, which is also recorded by mark sensing on the original card, is the 
classification of the worker by industry and occupation. The completed cards 
are then passed through a special machine which reads the recorded information 
and punches it on an ordinary Hollerith card. Checks for inconsistencies are 
made at the time of punching by mechanism of the type described in the next 
paragraph incorporated in the machine. It is found that the use of mark sensing, 
amongst other advantages, results in a clear saving of two weeks in the processing 
of the data. 

The other development which is of interest is the introduction of techniques 
and equipment for more effective checking for inconsistencies (mechanical 
editing} referred to in Section 5.20. Machines are now available which will 
check for inconsistencies (including inconsistent relationships between different 
fields) on any of the 80 columns of the card. Sometimes, depending on the 
complexities of the inter-relationships being checked, or as many as 
twenty-five columns can be simultaneously examined. Cards pass through 
these machines at a rate of 450 per minute and those which fail the consistency 
checks are automatically separated from the others. Machines for tabulations 
of the census type are also continually being improved and elaborated. 
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It may also be mentioned here that the collator (Section 5.16) can be fitted 
with a card counting device. Amongst other uses this enables every qth card 
of a pack to be picked out (g being an integer not greater than 99). This can 
be of value in extensive systematic sampling from a pack of cards. 

In addition to developments in punched card machinery (which themselves 
include electronic devices) an entirely new field is being opened up by the 
development of high-speed electronic computers. Most of the early machines 
were built mainly for the purpose of undertaking the type of calculation 
required in the mathematical and physical sciences, but since the Second Edition 
of this book there has been very rapid development of machines suitable for 
large-scale data processing, and many of the machines now operating are 
eminently suitable for the types of computation required in the analysis of 
censuses and surveys. In view of their importance in this field a brief description 
of their main features and mode of operation is given in Chapter 11. 



10.2 Methods of taking a stratified sample from a list or card Index 

Although there is no theoretical difficulty in taking a stratified sample 
from a list or card index, the practical difficulties are often considerable if the 
register is at all large. In this section various alternative procedures are 
discussed. 

As an example we will consider alternative methods of taking a sample of 
vehicles from a register of 100,000 vehicles which can be stratified into 6 strata 
with numbers approximately as in Table 10.2. 

TABLE 10.2 NUMBER OF VEHICLES IN A REGISTER 



Type of operation 


0-10 cwt. 


11-20 cwt. 


Over 20 cwt. 


For hire 
On own account 


20,000 (50) 
40,000 (100) 


20,000 (200) 
10,000 (100) 


8,000 (160) 
2,000 (40) 


Sampling fraction . 


1/400 


1/100 


1/50 



It is desired to take a stratified sample with a variable sampling fraction 
having the values shown. The numbers in brackets give the numbers in the 
sample (total, 650). 

(1) Cards arranged (or easily arrangeable) in the required strata 

In this case, every qth card of a stratum will be taken, l/q being the required 
sampling fraction. It is best to start with a random number between 1 and q. 
Thus for the stratum of over 20 cwt. on own account we should take every 
50th card out of the 2000, starting with a random number between 1 and 50. 
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(2) Numbers in strata known, but cards not easily arranged in strata 

A preliminary sample of every qth card can be taken, where l/q is somewhat 
larger than the largest sampling fraction / . A reasonable rule for determining 
q is as follows : 

If the number required in the stratum with the largest sampling fraction 

(or the smallest of these strata if there is more than one) is /z , and the total 

number in the stratum is N Q , q should be chosen as the next convenient 

whole number less than (or equal to) q\ where 



There is then a chance of about 9/10 of getting the required number or 
more in this stratum in the preliminary sample. 

In our example N = 2000, n = 40. Hence q / = 40 and consequently 
g=40. 

The cards of the preliminary sample must be withdrawn, sorted by strata 
and counted. For each stratum the requisite fraction is then taken or rejected 
(at random or by selecting every rth card) so as to obtain the number required 
in that stratum in the final sample. 

If the register is in the form of a list, the numbers in the different strata 
in the preliminary sample must be marked and counted, or copied in a new list, 
and the requisite fraction then selected as above or by the method of simultaneous 
selection given in Section 3 (a) below. 

The above procedure is not satisfactory if any of the sampling fractions are 
large, particularly if the corresponding strata are a small fraction of the whole. 
All the members of such strata should be picked out by examination of all 
cards. The remainder can then be dealt with as above. 

(3) Numbers in strata not known, and sorting troublesome or impractical 

(a) Simultaneous counting out of strata. 

Since the sampling fractions are known, we can keep running counts of all 
strata simultaneously, going through the register card by card. Every qth card 
of a particular stratum is withdrawn as it is reached. This method requires 
considerable care to avoid misclassification in the counting. It is best for two 
people to co-operate, one calling out the classification, and the other keeping 
the counts. 

(b) Complete count as (a) but without selection of the sample, which can then 
be selected as in (2). 

(c) Count of strata from every rth card only, every q/rth card of a stratum 
being selected as in (a). 

Thus in the example, if every 10th card is examined, every 40th of those 
falling in the two smallest size-group strata will be selected, etc. This is a 
case of two-phase sampling. 
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(d) As (c) but picking out every rth card, sorting into strata, and sampling 
as in (1). 

In the case of methods 3(c) and 3(d) certain additional errors due to the 
two-phase sampling will be introduced. We therefore require to determine 
the fraction of cards which should be taken at the first phase. 

Suppose that in our illustrative example we have the following effective 
variances per vehicle (i.e. the quantities such that the error variance of the mean 
of a sample of n is the effective variance/w, the sampling fractions being taken 
to be small) : 

Type of sample Effective variance 
Random 100 

Stratified with variable sampling fraction . . . . 10 

We shall then have the following error variances with samples of 650 vehicles : 

Type of sample Error variance 

Random 100/650 = 0-15 

Stratified with variable sampling fraction . , 10/650 = 0-015 

Two-phase, every 10th card 

1st phase 100/10,000 = 0-010 

2nd phase 10/650 = 0-015 



Total 0-025 

Two-phase, every 5th card 

1st phase 100/20,000 = 0-005 

2nd phase 10/650 = 0-015 



Total 0-020 

The fraction included at the first phase should be related to the ratio of the 
amounts of work required to select the sample and to collect the information 
after selection. With every 5th card, information will have to be collected 
from one-third more of the vehicles (-020/-015 1) than if a complete count 
of strata were made. Proper figures for the variances must, of course, be 
obtained from actual data before reaching any conclusions. 

The two-phase method can profitably be used when considerable extra 
work is entailed in obtaining particulars for stratification, such as looking up 
vehicles in a supplementary register, 
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10.3 A method of compensating for incomplete ascertainment in a 
stratified sample with uniform sampling fraction 

If information is lacking on certain units of a stratified sample (due for 
example to non-response), the raising factors of the different strata can be ad- 
justed so as to preserve the correct weighting between the different strata. 
As has been pointed out in Section 5.22 this will only properly compensate 
for the missing units in so far as these are similar, in the variates concerned, 
to the remaining units of the stratum in which they fall. If this is the case, 
and if the differences between strata are large and the proportion of missing 
units varies considerably from stratum to stratum, an adjustment of this land 
may substantially improve the accuracy of the estimates. 

With such an adjustment of the raising factors, however, the simplicity of 
analysis that exists with a sample with uniform sampling fraction will be lost. 
This is a matter of some importance in large-scale surveys with many variates, 
particularly when punched-card machinery is used. 

A simple method of avoiding this difficulty is to reject units at random from 
the strata with the smaller proportions of missing units, the numbers rejected 
being so chosen that after rejection all strata have the same proportion of missing 
units. This, however, will result in loss of information which will be consider- 
able if a few of the strata have relatively large proportions missing. 

An alternative, which does not entail much extra work when punched card 
machinery is used, is to reject units (by the selection of cards at random) from 
the strata with the smaller proportions of missing values, and to duplicate 
cards of units selected at random from the strata with the larger proportions 
of missing values, the numbers rejected and duplicated being so chosen that 
the strata are finally represented in the correct proportions. 

As a simple example, we may consider a stratified sample with four equal 
strata, each of 1,000 units. Information is available on 80, 85, 90 and 95 per 
cent, of these units in the four strata. If we reject and duplicate so as to bring 
all percentages to 90, i.e. 900 units per stratum, we shall require to duplicate 
100 cards from the first stratum and 50 from the second, and to reject 50 cards 
from the fourth stratum. The mean of the first stratum will then be given by 



where S' and S" indicate summation over the non-duplicated and duplicated 
units respectively. Consequently, if the population is large and the variance 
per unit in the first stratum is c^ 2 , the sampling variance of the mean of the first 
stratum will be 

V (y^ = {700 of + 4 X 100 a 1 2 }/900 2 = 0-00136 V 

compared with the variance o^/SOO = 0-00125 c^ 2 of the unweighted stratum 
mean. The variance of the mean of the second stratum can be calculated 
similarly. If the variance is the same within each stratum this equals 
0-00123 (jj 2 . The variances of the means of the last two strata will each be 
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The estimate of the mean of the population will be given by J of the sum 
of the means of the strata, and its variance will consequently be ^ of the 
sum of the variances. This is found to be 0-000301 a^. The properly weighted 
estimate (i.e. the mean of the strata means without rejection or duplication) 
has a variance of 



^ . = 0-000287 

16 800 850 900 950 

The estimate obtained by rejecting units from all strata except the first, so that 
800 units are retained in each stratum, has a variance of <J 1 2 /3200 = 0-000312 c^ 2 . 

Consequently 4-6 per cent, of the information is lost by combined rejection 
and duplication, and 8-0 per cent, by rejection only. These losses are additional 
to the losses due to missing information. 

The loss in actual cases that have to be dealt with can be calculated in the 
above manner. If there is doubt about the most appropriate level of rejection 
the losses with different levels of rejection can be determined and compared. 
It will be found that in general there is least loss of information if rather more 
units are duplicated than are rejected. 

10.4 Optimal allocation for more than one variate 

When an estimate of the population total or mean of a single variate is 
required, the optimal sampling fractions for a stratified random sample with 
variable sampling fraction are given by the equations of Section 8, 17 (a). 
If estimates for two or more variates with different within-strata variances are 
required, these equations will give different values of the sampling fractions. 
It is often asked how the sampling fractions should be chosen in such 
circumstances. 

Suppose there are two variates which are denoted by y and y' with within- 
strata variances <r/ 2 and a/ 2 . Three cases arise. Firstly, if sampling fractions 
are chosen which are optimal for Y it may be found that Y' is estimated with 
more than the required accuracy. Secondly, Y may be estimated with more 
than the required accuracy when the sampling fractions are optimal for Y'. 
Thirdly, neither of these conditions may hold. In the first two cases no problem 
arises, since we choose the sampling fractions which are optimal for Y or Y x 
respectively. In the third case, as mentioned in Section 3 . 5, sampling fractions 
which are sufficiently nearly optimal can often be determined by compromise, 
without any exact investigation, but a formal solution is possible on the lines 
of Section 8.17. 

Without imposing additional conditions we cannot minimise the variance 
for a given cost, since two variances are involved. We can, however, minimise 
the cost for given values of V (Y) and V (Y '). If these values are denoted by 
A and A* and the cost is C, we have 

V (Y) = 2 a< 2 (1 -//) W/H = SN/ a, 2 (gt -1) = A (10. 4. a) 
V (Y') = 2 a/ 2 (1 -/,) N/ 2 M = S Ni a/ 2 (gt - 1) = A' (10. 4. b) 
C = S a m 
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We require to minimise C subject to conditions 10 . 4 . a and 1 . 4 . b. Following 
the procedure of Section 8 . 17 we may add multiples L and L' of these conditions 
and differentiate with respect to each nt in turn. Equating the differentials to 
zero we obtain the equations 

Ci-Lei*/ft*-L'<*t'*/fi* = Q 
Hence 

V(L*f+L>.n 

yet 

Extension to three or more variates merely requires the insertion of additional 
terms L" at" 2 , etc., in the numerator, 

Substitution for the gt (= I//,-) in equations 10. 4. a and 10. 4. b will give 
two simultaneous equations for L and L'. Unfortunately these equations 
cannot be easily solved. A solution by successive approximation on the lines 
of the example below is therefore necessary. 

If we put L = h\K and L' = A'/K with A + A' = 1, so that L -f L' = I/K, 
we have the alternative form 



(10 . 4 . d) 

yK ya 

The numerator is now a weighted mean of a/ 2 and a/ 2 , and the analogy with 
equation 8.17.b will be apparent. 

Putting yVi/V (^ a * 2 + ^ /C7 * /2 ) =gio we have gi = giv\/K, and therefore, 
from equations 10. 4. a and 10. 4. b, 

. S Nt afgto = V (Y) + S Ni a/ 2 (10. 4. e) 

S N/ aj'^fo = V (Y') + S Ni a/ 2 (10. 4. f) 

For any assumed value of A the value of y^K which gives V (Y) = A can 
therefore be determined from equation 10.4.e. The value of V(Y') for this 
value of y"K can then be found from equation 10.4.f. The value of A which 
gives V(Y / ) = ^1 / under these circumstances can thus be determined by 
successive approximation, taking different trial values of A. This provides an 
alternative method of solution. If desired the roles of y and y / can be inter- 
changed in this process. With more than two variates this alternative method 
of solution is preferable, since one dimension is thereby eliminated. 

Equations 10. 4. e and 10. 4. f are also of use when trial values of L and L' 
are taken directly. With such trial values neither V (Y) nor V(Y 7 ) will have 
the required value. We can, however, adjust both L and L' by a factor 6, 
so chosen that neither V (Y) nor V (Y ') exceeds its required value. The sampling 
fractions will then all be multiplied by a factor ^/O. The value of ^/6 is given 
by equation 10. 4. e or 10. 4. f (the larger value being taken) with <\/K replaced 
by I/ \/6 and gio by the raising factors calculated from the trial values of L and I/. 

If any of the ft are found to be greater than unity the solution must be 
revised, as indicated in Section 8.17 (a). 

The problem is also capable of re-formulation in terms of the minimisation 
of costs plus losses and in this form has a direct solution. If, following Section 
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8 . 18, the expected losses are taken to be a V (Y) and a' V (Y x ) and we minimise 
the costs plus losses, we shall obtain the equations 10. 4. c with L and / re- 
placed by a and a'. If a and a' are known, these equations can be solved 
immediately. 

If a and a r are not known but the ratio between a and a f can be assumed, 
we have Ji = a I (a + a'} and /!/ = a r \(a + a'), and a value of <\/K can be chosen 
which gives the required general level of accuracy, using equations 10. 4. e 
and 10. 4. f. In a survey to determine the amount of unemployment in various 
industries stratified by districts, for example, it might be reasonable to attach 
equal importance to errors in the unemployment totals proportional to the 
square roots of the numbers employed in the industries, in which case a, #', 
etc., would be taken inversely proportional to these square roots. It should 
be noted, however, that this does not imply that the actual standard errors 
will be in the ratio of these square roots. The ratio obtained will be given by 
equations 10. 4. e and 10. 4. f and may differ substantially from the ratio of 
the square roots. 

Example 10.4 

Using the data on Hertfordshire farms described in Section 3.7, etc., and 
the size-groups of Example 6 . 7, determine the sampling fractions which will 
be optimal for simultaneously estimating the number of farms growing wheat 
and the total wheat acreage, each with a standard error of 5 per cent. The cost 
per farm is to be taken to be the same for all strata, i.e. all a = 1. 

We require the numbers of farms in each stratum, the proportion of farms 
growing wheat in each stratum, and the within-strata variances of the acreages 
of wheat of individual farms. Composite estimates from the various tables 
given in the preceding examples have been made. (The actual values could, 

TABLE 10. 4. a HERTFORDSHIRE FARM DATA 



Size- 
Stratum group 


N, 


Pi 


tf 


** 


N (J< 


N,** 







V 


1 1 


440 

























2 6- 


520 


04 


0384 


2 


20-0 


1040 


19596 


1 


414 


3 21 - 


360 


2 


16 


15 


57-6 


5400 


4 


3 


873 


4 51 - 


520 


6 


24 


160 


324-8 


83200 


48990 


12 


649 


5 151 - 


400 


8 


16 


650 


64 


260000 


4 


25 


495 


6 301 - 


210 


9 


09 


1700 


18-9 


357000 


3 


41 


231 


7 601 - 


50 


1-0 





4500 





225000 





67 


082 


2500 


285-3 


931640 



340 



MISCELLANEOUS DEVELOPMENTS SECT. 10.4 

of course, have been calculated from the original data, but these were not 
readily available.) These are shown in Table 10. 4. a, where p t denotes the 
proportion of farms growing wheat and st /2 represents the within-strata variances 
of the wheat acreages. There is no need to describe how the estimates were 
obtained. 

From the table the estimate of the number of farms growing wheat, denoted 
for convenience by Y, is 

Y = S N* p* = 963-8. 

A standard error of 5 per cent, will therefore be equivalent to a variance of 
48- 19 2 or 2322. The total wheat acreage is 44,676 acres (Section 3.7) and the 
corresponding variance is therefore 4,990,000. The values of sf for numbers 
of farms are given by the equation 

$i* = Pi <li 

These values are tabulated in Table 10. 4. a. 

As a first step the sampling fractions (or raising factors) which are optimal 
for each variate separately should be calculated. Following the method of 
Section 8.17 (a) we require to calculate 2 Ni st -\/Ci and S N; $ t 2 and the 
similar functions for wheat acreage. Ni st 2 and N/ $/ 2 , which are required in 
the subsequent calculations, are tabulated in Table 10. 4. a, as are si and st'. 
We then find 

S Ni s t ^a = 723-65 S N* st* = 285-5 

2 Ni st'\/ct = 30918 2 N* $/' = 931640 

Hence from equation 8.17.C we have, for the optimal sampling fractions for 
number of farms, 

If^/K = 723-65/(2322 + 285-3) = 0-2775 
L = 1/K= -07701 

The values of ft may now be calculated from equation 8.17.b, using the values 
for $t given in Table 10. 4. a. These values are shown in Table 10. 4. b. 

TABLE 10. 4. b OPTIMAL SAMPLING FRACTIONS FOR NUMBER OF FARMS, 

FOR ACREAGE, AND FOR BOTH VARIATES SIMULTANEOUSLY 
Stratum No. of farms Acreage Both variates 



2 


0544 


0074 


0478 


3 


1110 


0202 


0981 


4 


1359 


0660 


1268 


5 


1110 


-1331 


-1313 


6 


0832 


2153 


1603 


7 





3502 


-2324 
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Similarly for the optimal sampling fractions for acreage 
\WK'** -005221 

L Q '= -00002726 
giving the values of ft for acreage of Table 10. 4. b. 

It is clear that neither of these two sets of sampling fractions gives the 
requisite accuracy in the other variate. Something intermediate is therefore 
required. As a first trial (a) we may take L= -04 and Z/ = -000013, i.e. 
roughly half L and </. The intermediate calculations for V (Y) and V (Y 7 ) 
are given in Table 10 . 4 . c. The values for gi can be calculated on the slide 

TABLE 10. 4. c VALUES OF// AND g t FOR L = -04, L' = -000013 
Stratum L$* + LV 2 fi Si 



2 


001562 


0395 


25-317 


3 


006595 


0812 


12-315 


4 


011680 


1081 


9-251 


5 


014850 


1219 


8-203 


6 


025700 


1603 


6-238 


7 


058500 


2419 


4-134 



118887 

rule, using equation 10. 4. c. From equations 10. 4. a and 10. 4. b we then find 

V (Y) = 2728 
V(Y 0=5220000 

Both of these values are too high. Further trial values (b) L = -06, 
V ' = -000013 and (c) L= -06, Z/ = -000010 were therefore taken. The 
values of V (Y) and V (YO obtained were 

(b) L = -06, L'= -000013 (c) L = -06, Z/ = -000010 
V (Y) = 2274 V (Y) = 2332 

V (YO = 4812000 V (YO = 5301000 

If three points A, B and C with coordinates corresponding to the values 
(a), (b) and (c) of L and L 7 are plotted, the points P and Q on the Hnes AB and 
BC where V (Y) is estimated to have the required value 2322 can be determined 
by linear interpolation. The line PQ then represents an approximation to the 
curve of values of L and I/ for which V (Y) has the required value. A similar 
line can be drawn for V (Y 7 ). These two lines are found to intersect at the point 
L = -059, Z/= -0000120. A check computation of the values of V (Y) and 
for these values of L and L / gives 

V (Y) = 2310 V (YO = 4972000 
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lence L and L' have been determined with all necessary accuracy. The 
esultant optimal values of ft are included in Table 10. 4. b. 

The alternative method of solution using A and xl 7 gives similar results and 
5 left as an exercise to the reader. 

It is worth noting that all three sets of trial values give sampling fractions 
vhich, after adjustment so that neither variance exceeds its required value, 
ire reasonably efficient. In order to compare the efficiencies the sampling 
ractions given by the chosen values of L and L f must be adjusted by multipli- 
cation by \/6, calculated as explained above. After these adjustments the 
lumbers of farms in the samples and the relative efficiencies are found to be ; 

X' No. of farms Relative efficiency 

(a) -000325 231-7 96-4 

(b) -000217 223-9 99-8 

(c) -000167 230-5 96-9 
Optimal -000203 223-4 100 

10.5 The double-ratio estimate 

The ratio method is applicable to populations in which the ratio of two 
variates, y and x, is less subject to variation than either variate separately. 
The method can be extended to the case in which there are four variates y^ se l9 
j 2 , x 2 , such that the ratio of the ratios y 1 /x l andj> 2 /#2 is less subject to variation 
than either ratio separately. This extension is due to N. Keyfitz * and 
has been applied by him to the estimation of the total labour force, wages, 
salaries, materials used, etc., in the case in which there is an initial complete 
census of production (denoted here by xj, and of the labour force (denoted 
by JXi)> etc -> anc * subsequently a further complete census of production, # 2 , 
but a sample only for labour force, y 29 etc. In this case it may be anticipated 
that the production per worker in a factory, though it may vary from factory to 
factory, and from period to period in the same factory, will increase or decrease 
in much the same ratio from period to period for all factories. 

The required estimate of the total labour force on the second occasion is, 
in the case of a random sample, 



-- 

"-<T(* 2 ) S(yO 
The variance of Y 8 is given by the approximate formula 



, 4_ 

where * _ _ _ _ _ -^ + 

* I am indebted to Dr. Keyfitz for permission to publish an account of this method, 
which he first described in a lecture given at the London School of Economics. 
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The parallelism with the formulas for the single-ratio estimate already given 
will be apparent. The extension to the case of stratified sampling is similar 
to that followed in the case of the single-ratio estimate. 

In the case of the labour-force estimate Keyfitz found that the squares of 
the coefficients of variation, i.e. (S.E. of estimate/estimate) 2 , for the double- 
ratio estimate and the two possible single-ratio estimates were as follows : 
Double-ratio estimate ......... c 2 = 0-0012 



Single-ratio estimate, X 2 s ...... ^ = 0-040 

Single-ratio estimate, Y, ) ...... ^ = 0-303 



The double-ratio estimate is therefore in this case much more accurate than 
either of the single-ratio estimates. 

10.6 Relative precision of biased and unbiased estimates 

The use of biased estimates introduces errors (due to the bias) which may 
be large relative to the random sampling errors. Nevertheless we often have 
good grounds for believing, either on the basis of previous experience of the 
material under consideration, or from detailed statistical analysis, that the 
errors due to bias are actually small, particularly when comparisons between 
different domains of study are required. If, therefore, a large reduction in the 
random sampling error is effected by the use of a biased estimate this may be 
preferable to the unbiased estimate. The relative precision of biased and 
unbiased estimates (i.e. the reciprocal of the ratio of their respective sampling 
variances) can be determined by estimating the variances of the two types of 
estimate by the methods appropriate to the estimates in question. In the 
particular but important case in which the unbiased estimate consists of a 
weighted mean of the observed values z with weights w t and the sources of 
variation are such that all % can be regarded as subject to the same variance, 
the unweighted mean of the values will provide an estimate of the mean which 
has minimum variance. The ratio of the variances of the two estimates will 
then be, from formula 7.5.e, 

1 / S(w s ] (mean 



n I { S (w) } 2 mean of w* 

Example 10.6 

An estimate of rabbit damage to the wheat crop of a county is made on a 
random sample of farms. On each selected farm one of the fields growing 
wheat is selected with probability proportional to the area of the field. The 
damage is estimated by comparing fenced and unfenced areas. If the distri- 
bution of wheat acreages is that given in Table 7 . 2 what is the relative precision 
of the weighted and unweighted estimates ? 
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In this case the components of variation are such that the unweighted 
mean of the losses per acre on all sampled fields is likely to provide the estimate 
with approximately minimum error variance. This estimate will, however, 
be biased if the loss per acre is correlated with the area of wheat on the farm. 
An unbiased estimate will be provided by a weighted mean of the losses per 
acre, the weights being proportional to the areas of wheat on the sampled 
farms. The farms not growing wheat must be excluded from the calculation, 
since they are automatically excluded from both the weighted and unweighted 
means, even if they are included in the original sample. From the results 
given in Example 7.2.b we have 

n' = 45, S (y) = 2301, S (f) = 207,261 

where n' is the number of farms growing wheat. Here w = y. Hence the 
relative precision is 

1 207,261 



The biased estimate has therefore nearly twice the precision of the unbiased 
estimate. If it were possible to select farms with probability proportional to 
area of wheat grown (one field, selected with probability proportional to area, 
being sampled on each selected farm), the unweighted mean would provide 
the unbiased estimate. The reciprocal of the above fraction, 1-76, therefore 
gives the advantage, when an unbiased estimate is required, of using this 
method of sampling, i.e. of giving each acre an equal chance of being selected 
for assessment of loss instead of taking a random sample of farms. A method 
of selecting a sample which approximates to this requirement is described in 
Section 10.8. It may be noted here that the relative precision of the two 
estimates is very similar to that found in Example 7.17 for the rather similar 
case of the rate of application of fertilizers. In that case the value was 1/1-91 = 
0-524. 

10.7 The sampling error of an estimate of bias 

In Section 7 . 23 it was pointed out that an estimate of the bias arising from 
the use of a biased method of estimation could be obtained by comparing the 
biased with the unbiased estimates for relevant subdivisions of the sample. 
In the common case in which the biased estimate under consideration is the 
unweighted mean of some variate #, whereas the unbiased estimate is provided 
by some form of weighted mean of #, an alternative approach to the problem 
is provided by the calculation of the regression of z on the weights w. If B 
denotes the estimated bias and b the regression coefficient of % on w we have 

Sz Swz __ S (w w) (ss - Is) 
= z-z w = -^~--^- = ;g~ 

*S (w w) (z 5) 
S(w -Iti) 1 
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bS (w wf 



Hence B = - 



Sw 



Thus for a given distribution of weights the magnitude of the bias is pro- 
portional to b. In particular, if the value of b for the population is zero there 
will be no bias, and if b is the same for different domains of study which have 
similar distributions of the weights the comparisons between the unweighted 
means for the various domains will be free from bias. The contribution to the 
sampling error of B due to error in estimation of b can be obtained by calculating 
the error variance of b from the formula given in Section 1 . 12. There will be 
a further contribution due to variation in the distribution of w from sample 
to sample, but this can ordinarily be neglected. 

It should be noted that the calculation of b does not give a more accurate 
estimate of the bias of the unweighted mean than that provided by the difference 
of the weighted and unweighted means. The two estimates are in fact identical 
The calculation of the standard error of B does, however, provide an indication 
of the probable limits of the bias. 

Example 10.7 

Use the above method to assess the evidence for bias in the estimate of 
the dressing of nitrogen per acre derived from the unweighted mean over all 
fields of Example 6.19. 

In the notation of Example 6.19, w = g f g /f x and * = r=y/#. The 
quantities Sw, Sz, Sw 2 , Swz, Sz* can best be calculated for each size-group 
separately, before introduction (where necessary) of the factors g / and g'*. 
Sw and Swz have already been given in the formula for r. Sz is given in Table 
6 . 19 .c. The results for the whole sample are : 

n =67 iSo>= 58,229 5^ = 29-36 

S (w - 55) 2 = 44,849,800 S (w - 5)(* '- *) = 2996 S (* - *) 2 = 3-4822 
b = 2996/44,849,800 = + 0-000,066,80 
J9 = - 0-000,066,80 X 44,849,800/58,229 = - 0-0515 

This agrees with the difference 0438 0490 = 0-052 of the unweighted 
and weighted means. Following the method given in Section 7 . 12 and illus- 
trated in Example 7. 12. a we find 

S.E. (b) = 0-000,033,6 

Hence 

S.E. (B} = 0-0258 

The actual value of B is almost double its standard error. There is therefore 
some evidence that the unweighted mean is biased, but the magnitude of the 
bias is not accurately determined. The limits of error given by plus or minus 
twice the standard error are + 0-0001 and 0-1031. 
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It should be noted that the amount of bias obtained in sampling a population 
depends not only on the properties of the population, but also on the sampling 
method followed. Changes in the sampling method will consequently affect the 
bias. In this example, for instance, the weights are affected by the first-stage 
sampling fractions. If sufficiently extensive data were available, the biases to 
be expected with different sampling fractions could be estimated by adjusting 
the contributions of the different strata to Sw, Sz, Sw*, Swz> Sz* so as to 
conform to the new sampling fractions. 

10.8 A simple method of sampling with probability proportional to 
size 

The standard method of selecting units with probability proportional to 
size is to form the running totals of the sizes of the successive units, i.e. the totals 
x u x i + ^2> #1 + #2 ~r #3) > an< 3 then to select the required number of 
numbers at random in the range 1 to X, where X is the total of all the ac's. 
If a number which is greater than x^ and less than or equal to x + #2 * s selected, 
for example, then the second unit is chosen. 

If the number of units from which selection has to be made is large the 
calculation of these running totals requires a considerable amount of labour, 
particularly if a printer adder is not available. In some circumstances, indeed, 
this labour is sufficiently great, relative to the other work of the survey, to rule 
out entirely the use of sampling with probability proportional to size. 

A simple and ingenious method of overcoming this difficulty has been 
devised by D. B. Lahiri. If there are N units, and M is a number greater than 
or equal to the largest x (the #'s being expressed in suitable units), then for 
each unit that has to be selected two numbers p and v are selected at random 
from the ranges I to M for p and 1 to N for v. If the size of the Ah unit is 
greater than or equal to /* the unit is selected. If it is less than JM the unit is 
rejected and a further pair of random numbers is chosen, the process being 
repeated until a unit satisfying the condition is obtained. 

There is no need to determine the largest x exactly, though if an excessively 
high value of M is taken this will lead to an unnecessarily large number of 
rejections. If there are relatively few large values of x the number of rejections 
will in any case be large, but there is no way of avoiding this (other than the 
formation of the cumulative totals) unless the large values can be picked out 
and segregated in one or more separate strata. When the size range is large 
such stratification is in any case frequently advisable for other reasons. 

In two-stage sampling in which the second-stage units require to be selected 
with equal probability the difficulty of a large size-range can often be overcome 
by selecting two or more second-stage units from very large first-stage units. 

Example 10. 8 

Devise a method of selecting a sample of wheat fields with probability 
approximately proportional to size which can be used in the rabbit damage 
survey discussed in Example 10.6. 
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If no records of the acreages of the crops are available farm by farm at the 
time the crops are sown, sampling with probability exactly proportional to area 
of crop is impossible. The acreages sown in the preceding year are, however, 
given in the June 4th returns, and the changes from year to year in the more 
important crops on a farm are likely to be small. If, therefore, farms are selected 
with probability proportional to acreage of wheat in the preceding year, and from 
each selected farm one field of wheat is selected from all the fields growing 
wheat with probability proportional to the acreage of these fields, the process 
of selection will give selection with probability proportional to the areas of the 
wheat fields in the district covered, except in so far as the areas of wheat grown 
on the farms in the year of the survey differ from those grown on the same farms 
in the previous year. 

In a survey of rabbit damage covering a large area it would be impracticable 
to form running totals of the wheat acreages of all the farms. Lahiri's method, 
therefore, provides a useful alternative. Owing to the relatively small number 
of farms growing a large area of wheat the number of rejections will be somewhat 
large, even if more than one field is taken from the farms with very large areas 
of wheat. The number of rejections may be estimated if the approximate 
distribution of the wheat acreages is known. If the distribution of wheat acreages 
is that given in Table 7.2, for example, and the value of M chosen is 80, the 
chance of retaining a selected farm with acreage of wheat x (x < 80) is #/80. 
If all farms in the second size-group had 5 acres of wheat, all those in the third 
15 acres, etc,, the average number retained when all the farms of the table are 
included in the selection process will be 



Since, however, farms with no wheat in the previous year may carry wheat 
in the current year some of these should also be selected. A reasonable rule 
will be to treat all farms which carry less than 5 acres of wheat as if they were 
carrying 5 acres. The chance of retaining a selected zero farm will then be 
j^, so that the number retained will be approximately 30 out of 125 or about 
24 per cent. Not all of these will be found to carry wheat in the current year, 
If x > 80 the farm will always be retained. If x lies between 80 and 160 
one field will be taken if x 80 < p, and two fields if x 80 > ^. Similarly 
two or three fields will be taken if x lies between 160 and 240, and so on. The 
expected number of farms on which two or more fields will be retained can be 
calculated as above. The expected number with two or more fields will be 

5 15 25 35 

3X 80 + 1X 80 + 3X 80 + 1X FO + 1 = 2 - 75 
and of these the one with 260 acres will have 3 or 4 fields. 

For the same number of sampled fields the selection of the fields with 
probability approximately proportional to area may be expected to give a gain 
in precision of the 'same order as that found for the unweighted estimate of 
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Example 10.6. The small differences in weighting which will result from 
lack of knowledge of the current year's wheat acreages are not likely to increase 
the variance appreciably. Indeed these weights are not likely to be sufficiently 
associated with the amount of damage to make the unweighted mean appre- 
ciably biased. On the other hand, if a random selection of farms with equal 
probability were made, the use of an unweighted mean might give a seriously 
biased estimate of damage since the degree of damage may be associated with 
size of farm. 

The relative precision calculated in Example 10.6 does not take into account 
the sampling of the farms that grew no wheat in the preceding year. An 
accurate estimate of the relative precision can be made when results are available 
from an actual survey, or from data which give the areas of wheat on the same 
farms in two consecutive years. 

10.9 Estimation of error in two-stage sampling with *miorm overall 
sampling fraction and selection at the first stage from within 
strata with probability proportional to number of second- 
stage units 

This important special case has been discussed in Sections 3 . 10 and 8 . 13. 
The formulae for the variance given in Section 8.13 can be stated in a somewhat 
more elegant form (see Gray and Corlett, 1950, D'). Instead of considering 
the variability of the values of rij between the different first-stage units of a 
stratum we may consider the variability of the corresponding totals of y for 
the different first-stage units. If the sample total of y for the/th unit (presumed 
selected) of the ith stratum is denoted by Yy (= Sy (yj), the estimate of the 
variance of these totals within stratum i will be given by 



Since there will usually only be very few (often two) first-stage units per stratum, 
some form of pooled estimate will be required, derived from the estimates of 
the various strata. This can be denoted by $t' 2 . The formula for the variance 
then becomes 

V (Y) = g* {st' 2 S */' (1 -//) + S */"* ///' (1 -/*")} 
where m is the total number of selected second-stage units in the tth stratum. 
This formula is exactly similar in form to that for V (Y) in a two-stage sample 
with equal probability of selection at the first stage. This latter formula can 
be easily derived from the formula for V (y) given in Section 1 . 17. 

If the number of second-stage units in each first-stage unit is taken as the 
measure of size, the same number, m" (=/N/// / ), of second-stage units will 
require to be selected from each first-stage unit in a given stratum, and 
rij = ytj. In this case we can of course work equally well with the totals 
Ytj or the means yy. If, however, the preliminary data on the number of 
second-stage units in each first-stage unit are not exact, as for example may 
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occur when the second-stage frame for the selected first-stage units is constructed 
after selection of the first-stage units, the formula given above, based on totals, 
still holds. An actual case in which this contingency occurred is described in 
Section 4.16. 

10.10 Relative efficiency of sampling with probability proportional 
to size and stratified sampling with variable sampling 
fraction 

The relative efficiency of sampling with uniform probability and with 
probability proportional to size of unit has been discussed in Section 8.9. A 
further question that arises is how sampling with probability proportional to 
size compares in efficiency with stratification by size and the use of a variable 
sampling fraction. 

For any particular type of material this question can be dealt with by 
comparing the variances of the two types of sample, calculated by the methods 
already described in Chapter 8. It is worth noting, however, that the analysis 
of variance procedure provides a rapid and elegant approximate method of 
making this comparison. 

If we have a random sample taken with probability proportional to size x 
and we stratify this sample (after selection) into size-groups of x, we can perform 
an analysis of variance between and within size-groups on the values of r 
obtained from the sample. This can be arranged as in Table 10. 10. a, where 
V w (r) represents the pooled estimate of the variance of r within size-groups. 

TABLE 10. 10. a ANALYSIS OF VARIANCE OF r 

Degrees of freedom Mean square 

Between size-groups t 1 

Within size-groups (w f 1) V w (r) 

Total - 1 V (r) 

If the ranges of the size-groups are sufficiently small for the variation in 
size of x within size-groups to be neglected, the sum of squares of r within size- 
groups can be taken as equal to S (m 1) ^ 2 /x; 2 . If in addition there is no 
great variation in the V* (r) for the different size-groups, or if all ni are large, 
we have approximately 



If the number of units in the zth stratum of the stratified sample is taken 
as m = n Xz/X the total number of units in the two samples will be equal, 
and the sampling fractions will be proportional to xj, and therefore about 
optimal. In this case ft = nt/Nt = nxj/X, and from formula 7. 6. a 

V (Y) = N 2 V (y) = GW) S m if (1 - //)/*, 

From equation 10.10 this approximates to X 2 V w (r)/, apart from the factors 
(1 -//) 
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For the sample with probability proportional to size V (Y) = X 2 V (r)/n 
(Section 7 . 15). Consequently the ratio of the mean squares V (r)/V w (r) in the 
analysis of variance gives an estimate of the efficiency of stratified sampling 
relative to sampling with probability proportional to size. This estimate is 
approximate because (a) it has been assumed that the variation in x within a 
size-group can be neglected, (b) corrections for finite sampling, 1 //, have 
been omitted, and (c) the m 1 have been replaced by nt. Allowance for 
(a) will cause a decrease in the estimate, allowance for (b) will cause an increase, 
while (c) is not likely to be important. Some further gain may be expected in 
stratified sampling by taking optimal sampling fractions instead of fractions pro- 
portional to mean size. 

The inverse of the process can also be used, starting with the data of a 
stratified sample with variable or uniform sampling fraction, or with the data 
of a random sample stratified after selection. This process is illustrated in the 
example below. Apart from saving computation it has the merit that reference 
to the original data for the calculation of values of r (which would be necessary 
if the formula for $ r 2 of Section 8 . 9 were used) is not necessary. 

Example 10.10 

Using the data on Hertfordshire farms described in Section 3.7, etc., and 
the within-size-group variances of Table 10. 4. a, compare the relative efficiency 
of sampling from within size-groups with sampling fractions proportional to 
the mean farm acreages of the size-groups, and unstratified sampling with 
probability proportional to farm acreage. 

As before, we may take x to represent farm acreage (acres crops and grass), 
and y to represent wheat acreage. We shall require rough estimates of x/ 
and fi for all size-groups. For xt the size-group means were assumed to be 
situated at one-third the group interval from the lower limit of the group. 
For y / weighted means were calculated from Tables 6 . 5 . b, 6 . 6 . b and 6 . 7 . b. 

TABLE 10. 10. b HERTFORDSHIRE FARM DATA: CALCULATIONS FOR THE 

CONSTRUCTION OF AN ANALYSIS OF VARIANCE OF T 



Size-group 
(1) 


Xi 

(2) 


y* 

(3) 


r* 
(4) 


* f 

(5) 


*/** 

(6) 


n f 
(7) 


ntfi 
(8) 


6- 


10 


0-2 


02 


2 


02 


5-2 


104 


21- 


30 


1-7 


0567 


15 


01667 


10-8 


6124 


51- 


85 


11-0 


1294 


160 


-02215 


44-2 


5-7195 


151- 


200 


31 


155 


650 


01625 


80 


12-4 


301- 


365 


77 


2110 


1700 


01276 


76-6 


16-1626 


601- 


600 


172 

r * 


2867 


4500 


0125 


30 


8-6010 


*17666 


246-8 


43-5995 
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The values obtained are shown in Table lO.lO.b. Taking sampling fractions 
of xi/1000 we obtain from Table 10. 4. a the values of nt shown. The rest 
of the calculations in the table are self-explanatory (r = S nt Pi/S^f). 

From columns 4 and 8 we find for the sum of squares of r between size- 
groups 

S ni rf r L m n = 0-8728 

and from columns 6 and 7 we find for the estimated sum of squares of r within 
size-groups 

2(m-l)sfl*? = 3-8152. 

The reconstructed analysis of variance of r therefore takes the form shown 
in Table 10.10.C. 

TABLE lO.lO.c RECONSTRUCTED ANALYSIS OF VARIANCE OF r WHERE 

SAMPLING IS WITH PROBABILITY PROPORTIONAL TO SIZE 

Degrees of Sum of Mean 

freedom squares square 

Between size-groups . 5 0-8728 

Within size-groups . 240-8 3-8152 -01584 

Total . . . 245-8 4-6880 -01907 

The approximate relative efficiency of the two methods of sampling is 
therefore -01907/-01584 = 1*20. Owing to variation of x within size-groups 
this is a slight overestimate when all sampling fractions are small, as explained 
above. The correction to the total mean square from this cause is in fact of 
the order of - -0015. Unless all sampling fractions are small, however, 
corrections for finite sampling are required. This will increase the efficiency 
of the stratified sampling, as will adjustment of the sampling fractions to their 
optimal values. 

10.11 Choice of probability function in sampling with variable 
probability 

Sampling with probability proportional to size is one form of what may be 
termed sampling itiih variable probability. The essence of such sampling is 
that the probability of selection of the different individual units is made pro- 
portional to some known quantitative characteristic of the units themselves. 
The same general theory will apply whatever the measure adopted. In two- 
stage sampling the number of second-stage units often provides a convenient 
measure of the size of the first-stage units (Section 8.13), but in some cases 
greater efficiency will be obtained if the probabilities are taken proportional 
to some function of the number of second-stage units, instead of proportional 
to the actual number. 

This question has been discussed by Hansen and Hurwitz (1949, A 7 )- 
They consider the case of two-stage sampling with uniform overall probability 
(self-weighting) and a cost function of the form 

C = C l U l + 2 H Z + C 3 W 3 
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where n^ is the number of selected first-stage (primary) units, # a is the expected 
number of second-stage units included in the selected first-stage units, and n s 
is the number of second-stage units included in the sample. The first and 
third terms of this cost function correspond to the terms of the cost function 
of Section 8.17 (d). The second term arises when a frame has to be con- 
structed for each selected first-stage unit, e.g. by listing or ground survey, and 
there are other costs, e.g. of travel, which are proportional to the total number 
of second-stage units in the selected first-stage units. Taking an illustrative 
example from the 1940 Census of Housing, Hansen and Hurwitz tabulate the 
relative efficiencies of selecting first-stage units with (a) equal probability, 
(b) probability proportional to the square root of the number of second-stage 
units, and (c) probability proportional to the number of second-stage units, 
for various ratios between c l9 c 2 and C B . For most of the chosen ratios, selection 
with probability proportional to the square root of the number of second-stage 
units is most efficient. Even when c 2 is zero, method (b) is only slightly less 
efficient than method (c). 

10.12 A simple relation between relative precision and relative 

efficiency 

It is worth noting that when two sampling methods are being compared 
for efficiency and one of them is random or stratified with uniform sampling 
fraction, there is a simple relation between the relative precision and relative 
efficiency of the two methods. Denote the two methods by S 1 and S 2 , S% 
being random or stratified with uniform sampling fraction. Let the relative 
precision (*S r 2 /'S'i) ^ tne two methods for a given sample size be P, and the 
relative efficiency when S^ is of the given sample size be E. Then the relation 
in question is 

=/i + (i-/i)^ 

where / t is the overall or average sampling fraction of S v i.e. the number of 
units in the sample divided by the number in the population. From this relation 
the relative efficiency can be obtained immediately from the relative precision 
and vice versa. The specification of the size of S l is necessary because both P 
and E will vary with variations in this size. 

The above formula depends for its derivation on the fact that for a random 
or stratified sample with uniform sampling fraction the variance of an estimate 

is of the form 1 1 rr I , where A is independent of n. It is therefore applicable 

\n N] 

to the case in which domains of study cut across strata, provided we admit the 
approximation used in Example 9.5. Consequently the relative efficiencies 
of the sampling methods of that example can be calculated directly from 
Table 9.5. b. 
Example 10.12 

Calculate the relative efficiencies of the sampling methods of Example 9.5 
from the relative precisions given in Table 9.5.b. 
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We have /i == 614/3296 = 0486. For total acreage of domain B, for 
example, the efficiency of a stratified sample with a uniform sampling fraction 
relative to the sample with variable sampling fraction is 0-186 + 0-814 X 0-65 
= 0-72. The precision of a random sample relative to the sample with variable 
sampling fraction is 0-65 X 0-71 = 0-46, and the corresponding relative 
efficiency is therefore 0-56. 

10.13 Detection of gross errors I use of ratios and regressions 

We have in Sections 5.6 and 5.20 referred to the importance of eliminating 
and controlling errors in the recording and statistical abstraction of survey 
data. Some further discussion of this important subject may, however, not 
be out of place. 

It is frequently not realised how difficult it is even for skilled observers and 
recorders to make measurements and to record facts without from time to 
time making errors. Unless far more elaborate precautions than are usually 
customary or possible have been taken, it can be confidently predicted that any 
large body of recorded measurements and observations will contain a percentage 
of errors. Even when this percentage is small the errors themselves are fre- 
quently of such a nature and magnitude that if uncorrected they seriously 
detract from the value of the results. 

Fortunately, the data themselves often permit statistical checks between 
the various measurements and observations. Qualitative characteristics can 
be examined for inconsistencies, such as that a male has borne children, and 
improbabilities, such as that a woman of 20 is credited with 5 children. Highly 
correlated quantitative characteristics provide a mutual check on gross errors 
in either variate. Sometimes an exact relation exists such that the sum of a 
number of variates is equal to a further observed variate. 

In large-scale surveys the possibility of utilising internal checks of this 
nature should always be explored. Such checks provide a control of the quality 
of the recording and abstraction. At the same time, if properly applied they 
can in many cases effectively eliminate most of the gross errors and thereby 
substantially improve the value of the results. The amount and type of checking 
required depends very much on the nature of the survey, but there is a general 
tendency to underestimate the amount that is worth while, often with disastrous 
results : as errors are gradually brought to light in the course of the analysis 
it slowly becomes apparent that more thorough checking and complete recom- 
putation is the only possible course. 

For quantitative characteristics ratios and regressions provide a valuable 
method of checking the original data and eliminating gross errors. When 
two variates are highly correlated, values which are in no way exceptional for 
either variate separately stand out as clearly abnormal when the two variates 
are considered in conjunction. The simplest method, if the data are not too 
numerous, is to plot one variate against the other. The data of aberrant points 
can then be examined. 
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If the error is in the recording, this will not be revealed by the data them- 
selves. Sometimes a repetition of the doubtful observations can be made, but 
this will not often be possible. Sometimes the error is an obvious numerical 
one, such as a round 10 or 100, which can be corrected. Often, however, 
there will be no way of determining the correct value. In such cases the 
measurement will either have to be rejected, or an estimated value substituted. 

Rejection of the measurement is the best and simplest course if this does 
not entail too much loss of information on other associated measurements, or 
throw the sample out of balance. Incomplete data are, however, frequently 
very troublesome to handle statistically, and consequently rejection of the 
measurement often necessitates the rejection of the unit as a whole. Substitution 
of an estimated value provides a simple method of preserving the remainder 
of the data on the unit in question. Usually an estimate can be derived from 
one or more of the regressions, determined graphically. 

Sometimes there are independent records which can be referred to when a 
recorded value is in doubt. In a recent anthropometric survey (Healy, 1952, 
D'), for example, photographs of children were taken at the same time as 
measurements were made. All the recorded measurements were checked in 
pairs, by plotting, the measurements which were most highly correlated being 
chosen as members of the pairs. Certain approximate summation checks were 
also possible. Aberrant values were checked against the original photographs. 
It was found that practically all the queried values were in fact in error. 

In this survey the recorded data were already punched on Hollerith cards. 
The plotting was carried out semi-mechanically, by sorting and making a card 
count of the numbers in each cell of the relevant two-way table, the actual cell 
numbers being then entered by hand on a two-way diagram. A method has 
also been devised for making a complete plot on a Hollerith tabulator (King, 
1949, B')- 

In extensive work all plotting can be dispensed with, once the ratios or 
regressions and the limits of error about them have been defined. The task 
is simplest with a ratio, since all that is necessary is to calculate the individual 
values of the ratio and examine those falling outside the prescribed limits, or 
to arrange for some mechanical sorting which effects this without actually 
calculating the individual values. The same methods apply in principle with a 
regression, but require greater elaboration. 

Checks of this kind have been found, from bitter experience, to be necessary 
in all exact anthropometric work. Many of the errors that occur in this type 
of survey can be traced to incorrect scale readings or incorrect recording. 
Automatic recording devices, which would eliminate such errors entirely, 
clearly require to be developed for use in this and in other fields where extensive 
sets of exact measurements have to be made. 

If no external checks, such as photographs or original records, are available, 
the problem of the rejection or correction of suspected errors becomes much 
more difficult. The subject is well discussed by Quenouille in Associated 
Measurements. Many rules have been proposed for the rejection of observations 
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from internal evidence, but they are of limited value, since they are based for 
the most part on the assumption that the underlying distribution is normal, 
an assumption which may be far from the truth. The best guides are probably 
knowledge of the material being handled, a study of the distributions involved, 
a reluctance to reject an excessive number of observations, and common sense. 
A further point to remember is that once rejection is resorted to, the estimates 
of variance become suspect, since it is always the extreme values that are 
rejected. On the other hand the existence of a few gross errors introduces 
an instability into the variances which may not exist in the actual material. 
For exact work, in fact, gross errors must be avoided at all costs. 

Another rejection problem of a rather different type arises when a unit 
known to be exceptional happens to occur in a sample. Thus in a recent 
agricultural survey of Jamaica a few small farms were obtained which were 
almost wholly devoted to coconuts, an exceptional crop for smallholders. 
With the variable sampling fraction used, the inclusion of these farms would 
have seriously distorted the district figures, and they were therefore omitted, 
A parallel example which occurred in the one per cent, sample of the British 
Census is described in Section 10.16. In such cases rejection is the simplest 
course to adopt, but such rejections should always be reported, and if necessary 
included in the national estimates derived from the sample. Essentially what 
is being done is to redefine the population with certain exceptional units 
excluded. Since the number of these units in the population is unknown there 
will be slight uncertainty as to the exact number of units in the remainder of 
the population, but this is not likely to introduce any appreciable errors. 

10.14 Lattice sampling 

If we have a square area of side p, divided into p 2 unit squares, we can select 
a sample of p unit squares in such a manner that every row and every column 
of the large square contains one of the selected unit squares. Such sampling 
is a special type of double stratification without control of sub-strata (Section 
3 . 4). The rows and columns of the square can represent any two-way classi- 
fication of the material in which there are equal numbers of classes in the two 
classifications and one unit in each sub-class. 

Similar schemes are possible for three- or more way classifications. With 
a three-way classification, for example, with p 3 units, a sample of p units can 
be selected such that one unit is taken from each class of each classification. 
Alternatively a sample of p 2 units can be selected in such a manner that there 
are p units in every class of each classification, the p units belonging to any one 
class of any classification being so selected that one unit falls in each of the 
p classes of the other two classifications. A sampling scheme of this last type 
will be defined by a Latin square of side^ 2 , that is, a square pattern of p letters 
in which each letter occurs once and once only in each row and each column. 
Table 10.14.Ji sshows a 6 X 6 square. The rows, columns and letters of the 
square represent the three classifications, 
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TABLE 10. 14. a EXAMPLE OF A 6 x 6 LATIN SQUARE 

C B E A F D 

F D A E C B 

D F C B A E 

E A D C B F 

B C F D E A 

A E B F D C 

Some element of randomization must, of course, be introduced in the 
selection of actual samples. If an estimate of error is required the type of 
randomization is governed by the form of the estimate, and is described below. 
If no estimate of error is required a sample of p units from a square can be 
selected by selecting a unit from the first row at random, then selecting a unit 
from the p - 1 unoccupied columns of the second row at random, and so on. 
The selection of a sample of p units from a cube is similar. For the first two 
classifications a sample of p units is taken from a square, as above, and^> letters 
are then allocated to these points to indicate the third classification. In the 
case of a sample of p 2 units from a cube, the rows, columns and letters of any 
available p X p square can be randomized among themselves. The details of 
this procedure are explained below. Examples of squares up to 12 x 12 are 
given in Statistical Tables, but if these are not available a randomization of one 
of the diagonal squares (i.e. a square with each letter on lines parallel to a 
diagonal) will suffice. This may be obtained directly by allocating the letters 
in the first column at random, and then allocating those in the first row (except 
the first) at random. The further rows may then be filled in by writing the 
letters in the same order as in the first row, beginning with the determined letter. 
No further randomization is required. 

Sampling of this kind, which may be termed lattice sampling^ is of 
particular use when the material to be sampled is of a type that lends itself to 
multiple subdivision on a square or cubic pattern. One such type arises in 
sampling schemes which extend over both space and time. An example is 
provided by a sampling scheme for estimating the catches of fish landed at 
various ports along the coast of India. Catches are there landed at all hours 
of the day and night, the times of landing depending on the tide, weather, etc. 
On any one part of the coast the times of landing at the different ports are highly 
correlated. It was therefore proposed that a sampling scheme be adopted in 
which every port would be sampled every day, the times of day being so chosen 
that for a group of p neighbouring ports a different part of the day was covered 
at each port. Moreover the times of day were to be so rotated that over a period 
of p days all times of the day were covered at each port. This requires a^> X p 
Latin square in which the rows, columns and letters represent ports, days and 
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times of day. A more complicated example involving two-stage sampling is 
provided by proposals for road-traffic censuses outlined in the next section. 

Various two-dimensional schemes of the lattice type were suggested by 
Tepping et al (1943, D) under the name of deep stratification. Their performance 
was tested out on housing data in an American city. The sampling unit was a 
city block, rent and size (number of housing units) being taken as the two types 
of subdivision. Each rent-size classification contained a number of blocks, 
one block being selected at random from each chosen cell. If the number of 
blocks is to be the same in all cells the rent classification must be varied within 
the different size subdivisions, or vice versa. If this is not done, the simplicity 
of the scheme is sacrificed. This lessens the effectiveness of schemes of this 
type for sociological or economic material, particularly if the two classifications 
are highly correlated. 

More complicated schemes, in which different probabilities of selection 
are assigned to different patterns, have recently been discussed by Goodman 
and Kish (1950, A 7 ). A scheme of the same general type appears to be used 
in the Canadian Labour Force Survey (Keyfitz and Robinson, 1949, F'). 

The problem of estimating the error of lattice samples must now be con- 
sidered. A general paper on the subject has been published by H. D. Patterson 
(1954, A"). Case (d) below is discussed by Hansen, Hurwitz and Madow in 
their book. 

(a) Square lattice 

No valid estimate of error is possible for a sample containing p units. With 
a sample of 2p units, however, a valid estimate is possible. The simplest 
procedure is to divide the lattice into p mutually exclusive sets of p units. Any 
Latin square effects such a subdivision, the letters defining the sets. These 
sets may themselves be regarded as complex sampling units. If two sets are 
selected at random from all p sets the contrasts between the sets will therefore 
provide an estimate of the sampling error with one degree of freedom. This 
estimate of error will not be of much value if only one square is sampled. If 
there are a number of squares each square will contribute one degree of freedom 
to the estimate of error. 

When p is even, however, there is an alternative procedure by which a 
sample of 2p units can be made to yield \p degrees of freedom for error. We 
start with a basic pattern made up of 2 X 2 squares, such as that shown for 
an 8 X 8 square on the left side of Table 10.14.b, We then rearrange first, 
the rows in random order amongst themselves, and secondly the columns in 
random order amongst themselves. If the two orders are 21748365 and 
13524678 the arrangement on the right of the table will be obtained. 

The symbols may now be taken to indicate the values actually observed. 
An estimate of the error variance per unit with 4 degrees of freedom will then 
be provided by 
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TABLE 10.14.b 8 x 8 ROTATIONAL SCHEMES FOR TWO TYPES OF SUBDIVISION 



Basic pattern 
123456 



Actual arrangement 
234567 



1 


tfj a 2 


I 


3 4 


2 


a z 4 


2 


01 ^2 


3 


*i b, 


3 


<4 d, 


4 


b* b, 


4 


6 3 64 


5 


Cl Cz 


5 


</B </4 


6 


Co, C 4 


6 


*1 6* 


7 


d, d* 


7 


3 <?4 


8 


d s d % 


8 


*1 C 2 



In the general case the divisor will be 2/>. Allowing for finite sampling, the 
sampling variance of the mean will be (1 2/p)s 2 /2p. With the specified 
randomization process it can be shown that this estimate of error is unbiased. 

(b) Cubic lattice, sample of 2p units 

In the case of a cubic lattice no unbiased estimate of error of the above 
type appears to be possible with a sample of 2p units. The best that can be 
done is to obtain one degree of freedom by selecting and contrasting two out 
of p 2 mutually exclusive sets ofp units. Such a group of sets may be constructed 
as follows. Let (a^ 2 , . . . a p ), (b l9 6 2 > *P) ( c i> c z> - r p) be three 
random permutations of the numbers 1 to p. Also let f} and y be two random 
numbers, the same or different, but not both 1, between 1 and p. Then the 
lattice co-ordinates of the two sets are 



with the proviso that when any suffix is greater than p, it is reduced by p. For 
example, if for a 6 3 lattice the random permutations are (5 6 1 2 4 3), 
(2 4 1 3 6 5), (4 3 6 2 5 1) and ft, y are 6, 4, the two sets are 

(5, 2, 4), (6, 4, 3), (1, 1, 6), (2, 3, 2), (4, 6, 5), (3, 5, 1), 

(5, 5, 2), (6, 2, 5), (1, 4, 1), (2, 1, 4), (4, 3, 3), (3, 6, 6). 

With a sample of 4p units an estimate similar to that of the square lattice 

sample is possible when^> is even, using a basic pattern of \p 2x2x2 ^ cubes. 

Unfortunately, however, the different variance components enter in different 

proportions into this estimate. Investigation shows that the unbiased estimate 

involves a difference of the mean squares corresponding to the two-factor and 

three-factor interactions of the 2 x 2 X 2 cubes. It will rarely be more accurate 
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than the estimate based on the three degrees of freedom obtained by taking 
four sets of p units at random from p 2 mutually exclusive sets. 

(c) Cubic lattice, sample of 2p 2 units 

When p is even, an unbiased estimate of error with %p 2 degrees of freedom, 
similar to that for the square lattice, can be obtained. The basic pattern is 
made up of 2 X 2 X 2 cubes. It can be represented by a Latin square made 
up of 2 x 2 squares, and a second Latin square obtained by reversing all the 
2x2 squares of the first square. The letters then represent the third co- 
ordinate. Table 10, 14. c shows a suitable pair of squares for p = 6. Larger 
squares can be constructed in a similar manner. 

TABLE 10.14.C BASIC PATTERN FOR A SAMPLE OF 2/> 2 UNITS FROM A CUBIC 

LATTICE 

A B C D E F B A D C F E 

B A D C F E A B C D E F 

C D E F A B D C F E B A 

D C F E B A C D E F A B 

E F A B C D F E B A D C 

F E B A D C E F A B C D 

AS before, the rows, columns, and letters must be randomized, the same ran- 
domization being used for each square. The randomization of letters is effected 
in the same manner as that for rows and columns, writing the letters A-F 
in random order, say B F A C D E, and substituting B for A, F for 5, A for 
C, etc. 

For the estimate of error we must calculate for each of the original 2 X 2 x 2 
cubes the difference between the sum of the four units of the first square and 
the sum of the four units of the second square belonging to that cube. These 
differences can be represented geometrically by assigning opposite signs to the 
points at the two ends of each edge of the relevant cube. The components 
of the differences can easily be picked out, since (with the letter randomization 
adopted) B goes with F, A with C, and D with E 9 and the components of each 
cube occur at the intersection of two rows and two columns which are the same 
for each square. 

There are p z such differences. If these are denoted by d we have for the 
variance of a single unit 



The sampling variance of the mean of the 2p 2 units is therefore (1 2/p) 
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d) Estimation oj error from complete data 

If we have available data on all units of one or more lattices, we can without 
lifficulty estimate what the sampling error of lattice sampling would have 
>een. Such information is necessary in the planning of surveys both for 
letermination of size of sample and for the study of the relative efficiency of 
attice sampling and other methods. 

The procedure of the analysis of variance can be applied to the complete 
iata of a square lattice in the manner of Table 10.14.d. 

TABLE 10.14.d ANALYSIS OF VARIANCE FOR A COMPLETE SQUARE LATTICE 

Degrees of 
freedom 

Rows (R) . . , . p 1 
Columns (C) . . . p-l 
Remainder (R x C) . . (p I) 2 



Total . . . 2 - 1 

The sum of squares for rows is given by the sum of the squares of the 
deviations of the row totals, divided by p, and similarly for the columns. 
The sum of squares for the remainder (known as the two-factor interaction 
R X C) is obtained by subtraction. The remainder mean square then gives 
an estimate $* of the variance per unit in lattice sampling. If we require to 
determine the relative precision of lattice sampling and simple stratification by 
rows, we calculate a new mean square for columns plus remainder, adding 
the degrees of freedom and the sums of squares. Similarly, rows plus remainder 
gives an estimate of the variance for stratification by columns, while the total 
line gives the estimate for random sampling. 

The procedure in the case of a p* lattice is similar. Denote the three 
classifications by R, C and L. Summation over L gives a p x p table of totals 
which can be analysed in the manner of Table lO.M.d to give R, C, and 

TABLE 10.14.e ANALYSIS OF VARIANCE FOR A COMPLETE CUBIC LATTICE 

Degrees of Mean 

freedom square 

R p-l 

C p-l 

L p-l ^ 

RxL (P-}}* B 

C xL (P-l) 2 c 

Rx C x L (p l? D 

Total p 3 I 
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R X C. The sums of squares must be divided by an additional factory because 
we are working with totals of p units. Tables for R and L and C and L can 
be constructed similarly. These can then be combined into a single table 
in the manner of Table 10.14.e. The three-factor interaction R X C X L 
is then obtained by subtraction. 

The estimate of error variance for a lattice sample of p z units (based on a 
Latin square) is given by the mean square D for R X C X L. The estimate 
of error for a lattice sample of p units is given by the expression 

s 2 = {A + B + c + (p - 2) n}/(p + 1) 

If p is large this tends to the pooled mean square for all four interactions. 

(e) Multi-stage schemes 

When each cell of the lattice contains a number of second-stage units, of 
which some only are selected, the estimation of error follows the ordinary lines 
for multi-stage schemes. If an estimate of the second-stage error is not required 
the selection of one second-stage unit from each first-stage unit will normally 
provide an adequate estimate of the total sampling error, since the first-stage 
sampling fraction is not usually large. 

Another type of two-stage sampling arises when the members of one of the 
lattice classifications are themselves a sample from a larger number of such 
classes. An example of this type of sampling is described in the next section, 
where the method of estimating the error is explained. 

10.15 Censuses of road traffic 

Statistics for road traffic, such as total vehicle-miles, passenger-miles 
and ton-miles, can be obtained either from returns made by vehicle operators 
or from counts and other observations of vehicles passing selected points of the 
road network. Statistics of tons loaded and length of journey can only be simply 
obtained from returns by vehicle operators. The sampling problems of the 
former type of census are straightforward, and need not be further discussed 
here, but those arising in road traffic counts present a number of special features 
which are of general interest. 

In addition to their use in estimating the volume of road traffic, traffic 
counts are of value in road planning. For this purpose counting points can 
best be located at strategic points in the road network, so as to estimate the 
volume of traffic on particular stretches of road. The counts themselves may 
be confined to a particular week chosen as typical, or possibly to two or three 
such weeks at different times of the year. Nor is there any need to cover 
periods of the day for which the traffic is known to be light. Classification into 
types of vehicle will frequently be required, but not information on loads 
carried. Consequently, unless information on the origin and destination is 
required it will not be necessary to stop vehicles. If automatic counting 
devices are installed the whole period covered may be sampled for purposes 
of classification of vehicles. 
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For the purpose of estimating the total Volume of road traffic a different 
procedure is necessary. Unbiased estimates of quantities such as passenger- 
miles, vehicle-miles and ton-miles, cannot be obtained if the counting points are 
located at strategic points in the network. Instead some method of random 
or systematic location is necessary.* 

If points are located at random on the roads of a network with a density of 
1 per k miles and from counts made at these points it is found that n ly n 2 . . . 
vehicles pass the points in a given period, then an unbiased estimate of the 
total vehicle-miles in the period is given by 

vehicle-miles = k (n^ + n% + . . . ) = k S (ri) 

Similarly an unbiased estimate of the ton-miles is k S(W), where W is the 
total of the loads of all the vehicles passing a given point in the given period. 
It will be necessary to stop at least a sample of the vehicles passing the chosen 
points to ascertain the loads carried. 

In order to locate points at random on the network a running total of the 
lengths of all the roads and sections of road comprising the network may be 
made in such a manner that each piece of road is included once and once only. 
If the total length is L miles, any number / less than L then defines a unique 
point on the road network. If j numbers are selected at random between 1 
and L these numbers will define points on the road network which will be located 
at random, each mile point having an equal chance of being selected. To give 
a density of one point per k miles, we take/ = L/k. 

Instead of locating the points at random they may be located systematically, 
i.e. at equal intervals with regard to the running total, using a random starting 
point h between 1 and k and selecting the points h, h + &, h + 2k . . 

In practice it will be advisable to use a variable sampling fraction, with a 
considerably higher density of points on the more important roads. For this 
purpose the roads must be stratified according to their importance. When this 
stratification has been made, separate running totals can be constructed for 
the different strata. If any large area is to be covered it will also obviously be 
advisable to divide the area into regions and treat each region separately. 

If desired, certain types of road such as roads within city boundaries, and 
minor roads in built-up areas, can be excluded entirely. This will automatically 
exclude the traffic on these roads from the estimates. 

Sampling may be used in various other ways to increase the accuracy of 
the results with a given expenditure of effort. Lattice sampling is particularly 
useful. Instead of carrying out continuous counts, for example, counts may 
be made at different hours of the day at different points, a rotation being arranged 
so that all periods of the day are equally covered, and that different periods are 
covered at the same point on different days. Thus the counts on a group 
of 12 points can be so arranged that on 12 consecutive days each 2-hour period 
is counted on one and one only of the 12 days, and that on each day every 

* The proposals which follow are based on a note prepared for a Working Party 
of the Inland Transport Committee of the U.N. Economic Commission for Europe. 

363 



SECT. 10.16 SAMPLING METHODS FOR CENSUSES AND SURVEYS 

2-hour period is covered at some point of the group. Since there is^iikely to be 
a considerable amount of variation between points, a rotation of this type may 
be expected to increase the accuracy considerably, since a much larger number 
of points can be included for the same amount of counting. 

There will of course be little advantage in using this type of sampling for 
counts if automatic counting devices are available, but the procedure will still 
be of value for sampling to ascertain loads, etc. The number of vehicles that 
have to be stopped may also be reduced by examining only a fraction of the 
vehicles passing during the time a point is under observation. Care must be 
taken to see that bias is not introduced by this procedure. If every third vehicle 
is examined, for example, there must be no element of choice, Le. the^ count 
must be based on the order of arrival. If the fraction is varied from time to 
time owing to variation in traffic density each such variation must be noted so 
that the correct raising factor can be used for each part of the data. 

It is essential for objective estimates that the whole of the day and night is 
covered. If the volume of traffic is substantially reduced during the night, 
however, it may be advantageous to use a variable sampling fraction here also, 
covering the night period less intensively than the day period. Similarly if an 
objective estimate of the total annual volume of traffic is required the different 
parts of the year must be properly sampled. 

The calculation of the sampling error is straightforward except for rotational 
schemes. Since the observation points themselves constitute a random selection 
from all possible points, a single Latin square arrangement of observations on 
12 points extending over 12 days with 12 periods in each day forms a second- 
stage sample of one unit out of the 12 units defined by any set of 12 squares 
which together comprise the whole of the traffic passing these points. The 
difference between the totals for a pair of Latin squares from the set will there- 
fore only give an estimate of the sampling error at the second stage. To obtain 
an estimate of the total sampling error, two different sets of 12 points must be 
taken for the two Latin squares. The square for each set should be inde- 
pendently randomized, but the points of each set can be obtained (with some 
gain in precision) by random selection of two points from each of 12 strata 
instead of 24 points from a single stratum. Thus in the simple case of the 
estimation of traffic along a single main road the road can be divided into 12 
equal sections. Two points are then located at random in each section, one 
of each pair of points being allocated at random to the first square, and the other 
to the second square. Each pair of squares yields only one degree of freedom. 
This limitation, however, is not of great importance in schemes which are 
extensive either in space or time, since many pairs of squares will be required 
for the whole scheme. 

10.16 The use of sampling to speed up analysis : the 1951 Census of 

Great Britain 

The value of sampling in reducing the volume of numerical and machine 
work and speeding up the analysis of a complete census has already been 

364 



MISCELLANEOUS DEVELOPMENTS SECT. 10.16 

stressed in Section 5.21. An interesting example of this use of sampling is 
provided by the 1 per cent, sample of the 1951 Census of Great Britain (General 
Register Office, 1952, C'). This sample was taken with the object of providing 
preliminary results within a year of the census. 

The general procedure of this census was as follows. Large institutions and 
analogous establishments likely to contain 100 or more persons each were 
first identified and listed by the local census officers. These institutions (termed 
special enumeration districts S.E.D.s with one institution to each district) 
were enumerated on special institution schedules. The remaining habitations 
of the area were divided into districts termed ordinary enumeration districts 
(O.E.D.s) of such a size that each district could be dealt with by a single 
enumerator. The O.E.D.s were given identification numbers ranging conse- 
cutively from 1 onwards within each local census area. All census officers 
delineated their E.D.s on large-scale maps, and for the purpose of ready reference 
numbered them adjacently as far as possible. The general effect of this is that 
odd and even E.D.s tend to be contiguous. In all there were 49,318 O.E.D.s 
in England and Wales with an average content of about 270 households, or 
860 persons, and 9,730 O.E.D.s in Scotland with an average content of 150 
households, or 510 persons. There was considerable variation in the numbers 
of households and persons In the different O.E.D.s owing to variation in local 
conditions. Each habitation of each O.E.D. was listed systematically prior to 
the actual census by complete traverse of the district by the enumerator con- 
cerned, and the census schedules were subsequently numbered in the list order. 
Each schedule covered one household. 

For the sample from the O.E.D.s each local enumerator was instructed to 
furnish copies of the schedules of all households bearing numbers ending in 25 
if his O.E.D. number was odd, and ending in 76 if his O.E.D. number was 
even. For the sample from the S.E.D.s each local census officer was instructed 
to number the individuals of each S.E.D. schedule consecutively from 1 onwards 
and to furnish copies of the entries bearing numbers ending in 25 for odd- 
numbered S.E.D.s and 76 for even-numbered S.E.D.s. 

Apart from the disturbance arising from the use of only two pairs of end 
digits this procedure if correctly carried out should yield an almost exact 
1 per cent, sample of households in the O.E.D.s and of individuals in the 
S.E.D.s. From the systematic method of selection adopted a high degree of 
stratification within O.E.D.s and S.E.D.s is obtained. The sample will not, how- 
ever, contain exactly 1 per cent, of the population owing to variation in the size 
of the O.E.D. households, which can range from 1 to approximately 100. 

A test of the sample as drawn was made by comparing it with the preliminary 
count of the full census. The result of this comparison for the whole country 
is shown in Table 10.16. The agreement on number of households is satis- 
factory for England and Wales, but there is some excess in the sample for 
Scotland. This disagreement is in fact of no consequence in view of the further 
adjustment (described below) which was made for each area which was to figure 
in the tabulation. The object of this adjustment was to bring not only the 
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TABLE 10.16 1951 CENSUS OF GREAT BRITAIN : COMPARISON OF THE i PER 

CENT. SAMPLE AND THE PRELIMINARY COUNT 







1/IOOth of 


Excess or defect 






full census 


of sample 




Q ,^-j 


( t> relim.ina.rv 








count) 


Amount 


Per cent. 


Great Britain 










Total population . 
Households in O.E.D.s 


488,395 
146,628 


488,411 
146,539 


- 16 

+ 89 


- -00 
-f "06 


England and Wales 










Total population . 
Households in O.E.D.s 


437,158 
131,973 


437,450 
131,998 


- 292 
- 25 


- -07 
- -02 


Scotland 










Total population . 


51,237 


50,961 


+ 276 


+ -54 


Households in O.E.D.s 


14,655 


14,541 


+ 114 


+ -78 



number of households but also the total population into agreement with the 
preliminary count. This ensured that the published sample totals would 
agree with other published totals which might later be derived from the tabu- 
lation of the full census material. The adjustment has, of course, the further 
effect of somewhat increasing the accuracy of the sample, though at the expense 
of slight distortion in certain respects. 

It is, however, instructive to examine in a little more detail the probable 
causes of the disagreement in numbers of households. If a systematic sample 
of every hundredth object with random starting point j between 1 and 100 
is taken from a number 100A + k of consecutively numbered objects, where 
h and k are integers and < k < 99, the number in the sample will be h or 
h + 1 according as 7 > k or < k. The mean number over a large number of 
samples will be h + &/100. The sampling variance of the number in the sample 
is given by the mean square deviation from the mean number, and is found to be 

-*-/!-- M 

100 \ 100 / 

If we have a large number of sets containing numbers of objects which differ 
in such a manner that the values of k can be taken to be evenly distributed over 
the range 0-99 the average sampling variance per set will be 



1 " k Ji _ k \ 
100 ? 100 \ ~ 100 / 



which equals 0-16665 or very nearly ^. 

Instead of taking each of the /'s independently at random we may select 
them in complementary pairs subject to the condition that j +j'= 101, 
assigning each such pair to a pair of sets. Thus if the first of a pair of sets is 
allocated a number 64 chosen at random, the second will be allocated the 
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number 37. By a calculation similar to that given above we then find that the 
average sampling variance per set, when the values of k and k' can be regarded as 
independently and evenly distributed over the range 0-99, is equal to 0-083325, 
or very nearly T \. This is half the variance obtained when each value of j 
is chosen independently.* The reason for the reduction is clear if we remember 
that a run of low values of j will lead to an overestimate, and a run of high values 
to an underestimate. The reduction is due to the balancing of high and low 
values. It does not depend on there being any particular similarity between 
the pairs of sets. If the values of k are highly correlated within pairs the 
variance will be further reduced. 

If the same pair of complementary values is always taken, as in the British 
Census, the distribution of k has to be considered. Analysis shows that with an 
even distribution of k the mean error (or bias) is zero, and the average sampling 
variance is 0-083325 as above. But with any other distribution of k some bias 
will be introduced. 

Since there are 49,316 O.E.D.s in England and Wales the standard error 
of the number of households in the sample due to sampling of the type adopted 
will be ^(0-083325 X 49,316) = 64. The similar error for Scotland is 
<V/(O083325 x 9,730) = 28. The discrepancy for England and Wales is 
therefore well within the sampling standard error, but that for Scotland is 4-1 
times its standard error. 

As is pointed out in the report, the discrepancy in Scotland can be accounted 
for by the fact that the selection number 25 is always associated with the odd- 
numbered O.E.D.s and the selection number 76 with the even-numbered 
O.E.D.s. As the O.E.D.s were numbered from 1 upwards in each census area, 
the excess of odd-numbered enumeration districts will be approximately one- 
half the number of census areas. There were 1,225 census areas in England 
and Wales and 1,026 in Scotland. The bias through consistently taking the 
selection number 25 will be approximately J per O.E.D. Therefore the 
bias introduced will be J X excess of odds, i.e. + 153 for England and 
Wales and + 128 for Scotland. The discrepancy in the number of house- 
holds for Scotland shown in Table 10. 16 therefore appears to be fully accounted 
for by this bias. In the case of England and Wales, however, allowance for the 
bias will increase the discrepancy to 178, i.e. nearly three times its sampling 
standard error. 

This secondary discrepancy may have arisen from the form of the distri- 
bution of k. The bias to be expected from this cause, however, can only be 
calculated from the distribution of the numbers of households in the different 
O.E.D.S. 

From the published description it would appear that there would in fact 
have been little difficulty in using all pairs of complementary numbers instead 
of the pair 25 and 76. This would have eliminated the biases discussed above, 
and would have been essential if a preliminary count had not been available. 

* I am indebted to the General Register Office for drawing my attention to the 
reduction in variance resulting from the use of complementary numbers. 
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A suitable method would have been to write down all pairs of complementary 
numbers 1-100 in random order, and also randomize the order within the pairs. 
The numbers of the sequence could then be allotted in turn to the O.E.D.s 
in whatever systematic order they presented themselves. In order to avoid the 
need for fresh randomization the same sequence could be used repeatedly. 

Apart from its effect on the numbers of households selected the use of only 
one complementary pair introduces the danger of a bias of a different type. 
If the numbering of the households in the O.E.D.s is such that the households 
of a certain type tend to be allotted numbers less than 25 this type of household 
will be under-represented in the sample. The method of numbering households 
in the British Census was such that little, if any, bias is likely to have arisen 
from this cause, but in other censuses where the situation is different serious 
bias might well be introduced. 

As already mentioned, an adjustment was made to attain simultaneous 
agreement with the preliminary count for numbers of households and numbers 
of individuals. In order to preserve the households as entities in the sample, 
complete households were added or removed. For each tabulation area the 
sizes of the households to be selected for addition or removal were specified 
so as to attain agreement also in the numbers of individuals. In some cases 
where there was an excess of population and deficit of households, or vice versa, 
it was necessary both to add and to remove households. Selection of the house- 
holds to be added or removed was accomplished by dividing the full or sample 
records of the area into as many sections as the number of households involved 
and selecting a household of one of the prescribed sizes from each section in 
accordance with a mechanical selection procedure. The adjustment process 
involved the addition of 1,266 new households and the removal of 1,255 existing 
households, both of which are less than 1 per cent, of the original sample total 
of 146,628 households. 

In a few cases this procedure broke down. In the case of Chelsea 
Metropolitan Borough, for example, the sample population of 603 was in 
excess by 94, while the number of households, 187, was in excess by 1 only. 
This was found to be due to a hotel with a population of 100 being included in 
the sample. In order to secure agreement on the population this was removed 
and another household of 6 persons substituted. 

This procedure, which is the only possible one if agreement is to be obtained 
for small areas, must in fact result in the elimination from the sample of a number 
of the exceptional " households/' such as hotels not sufficiently large to be 
classified as S.E.D.s. The households of this type will therefore be under- 
represented in the sample. Even if these households had been retained the 
information on them would be unreliable when classified by small areas owing 
to sampling variation. A measure of the degree of under-representation can be 
obtained by tabulating the households that were removed and comparing it 
with the corresponding tabulation of the households that were inserted. 

The undertaking has been completely successful in its objective of saving 
time in the tabulation and presentation of the results. A target date of one year 
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from the census day was provisionally set by which the analyses should be 
completed and the classified results made available in tabulated form. In the 
event the publication of the more important analyses was achieved at a date 
only a few weeks outside the target period and the remainder followed shortly 
thereafter. Previously census results of a similar character have required from 
three to four years for their publication. 
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CHAPTER 1 1 
ELECTRONIC COMPUTERS 

11.1 Scope of the chapter 

The introduction of electronic digital computers has opened np entirely 
new possibilities in the numerical analysis of survey results. Since such com- 
puters are still being very rapidly developed and improved the first machines 
only became operational in 1948 -no attempt will be made to describe particular 
machines. The present chapter only aims to give an outline of the main features 
of such computers, and to indicate the ways in which they are likely to be of 
use in the analysis of surveys. 

11.2 Genera! description of electronic computers 

A general-purpose electronic computer, as its name implies, is a machine 
which is capable of performing a wide variety of computations. The essential 
features of such a computer are 

(1) Very high speed the fastest machines can carry out many thousands 
of arithmetical operations a second. 

(2) Control of the arithmetical and other operations by instructions held in 
the machine in the form of numbers. Such instructions are termed 
orders, and can be read as required by the control mechanism. 

(3) Ability to perform tests, the outcome of which determines the further 
operations of the machine. 

(4) A large store for holding numbers, which can represent orders, data, 
and intermediate and final results,* The locations in the store are 
numbered for reference, and these numbers are known as addresses. 

(5) Methods of communicating with the outside world, so that orders and 
data can be fed into the machine and the results of the computations 
made available. 

The operations that can be called for by single orders vary somewhat from 
machine to machine, but are always very simple. There are orders for the basic 
arithmetical operations, addition, subtraction, multiplication, division, and 
shifting!; orders enabling numbers to be transferred between the store and the 
working registers; certain simple test orders, such as that the contents of a 
working register are zero, or negative; certain " logical " orders, the most 
important being collation, which enables any required digits of a number to^be 
picked out; and orders for operating the card or tape reading and punching 
devices, etc. 

* Since orders as well as numbers are represented, the generic term word is used in 
computer terminology. 

f Basically the operation of shifting consists of moving all the digits of a number 
left or right a given number of places. A binary arithmetical shift of n places left or 
right is equivalent to multiplication by 2** or 2-. 
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Since orders are stored in the machine in the form of numbers, they are 
usually written in some numerical code. The list of orders for any particular 
machine, together with the associated numerical code, is known as the order 
code of that machine. 

Machines work in either the decimal or the binary scale. In the latter each 
digit of a number has the value or 1 only, and the successive digits represent 
powers of 2 instead of 10. Thus with a 5-digit number 00001 = 1, 00010 = 2, 
00011 = 3, 00100 = 4, etc.* 

If the binary scale is used the machine computes the binary equivalents of 
decimal numbers as they are read in, and the decimal equivalents of binary 
numbers as they are read out, by means of appropriate sets of orders. 

Many of the larger machines use what is known as " floating point " arith- 
metic. In such arithmetic each number carries with it an auxiliary index number 
which indicates the position of the decimal or binary point. Thus 35700 = 
357 x 10 5 , and 0-00125 = -125 X 10~ 2 ; these numbers can therefore be 
written as (+ 5, + -357) and ( 2, -125). Their product, to three significant 
figures, can therefore be written as (+ 2, 446). The two parts of a number 
are stored together in one store location, and in arithmetical operations the 
values of the indices are taken into account and the answer is furnished in 
floating point form. The use of floating point arithmetic saves having to give 
thought to the positioning of the working decimal points when evaluating 
mathematical expressions involving multiplication and division, and avoids the 
need for shifts to preserve accuracy. It therefore greatly simplifies the program- 
ming of such operations, particularly when numbers of very differing magnitudes 
have to be handled by a single programme. 

In machines which have no built-in floating point facilities, floating point 
arithmetic can be used when required by means of an appropriate set of orders, 
though with some loss of speed. 



11.3 Types of store 

Most of the early computers had relatively small stores, but the modern 
trend is towards the provision of much larger stores. This enables more 
elaborate programmes to be used, and permits jobs which require the storage 
of large volumes of data and intermediate results to be handled. Wkh a really 
large store records can be held permanently in the machine and referred to and 
brought up to date as required. 

Storage devices vary greatly in their speed of access. The ideal would of 
course be to have a store in which every location could be referred to without 
appreciable delay. Such storage, known as immediate access, is however 
expensive, and even with the introduction of magnetic core stores, provision is 
not usually made for more than 1000-8000 words of immediate access storage. 

* A binary digit is frequently called a bit. Thus a computer may have a word-length 
of 32 bits. 
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For large-scale data processing work considerably greater storage capacity 
is required. This is provided by what are known as " backing-up " stores. The 
two chief types in use are the magnetic drum and magnetic tape. In magnetic 
drums the information is recorded by means of magnetised spots on a series of 
tracks on the surface of a revolving drum. In magnetic tape the information is 
similarly recorded on a reel of tape, with the feeding and rewinding controlled 
by machine orders. . 

A reel of magnetic tape can hold several million words, and since several 
magnetic tape units can be coupled to the same machine and reels of tape can 
be interchanged the total storage capacity is virtually unlimited. Time of access 
may, however, be considerable, since much feeding or rewinding may be needed 
to gain access to the required part of the tape. 

Magnetic drums usually have a capacity of the order of 2,000-50,000 words. 
Time of access is governed by the number of words to a track and the switching 
arrangements between tracks. Access is of course much speedier than with 
tape, and in some small machines a drum is used to provide practically the 
whole of the storage. In such machines it is usual for each order to contain, as 
part of the order, the address of the next order. By this means the orders can 
be so arranged on the drum that the delay in waiting for the next order is 
minimised (optimal programming). In machines with a sufficiently large 
immediate or fast access store, orders are held in this store while being obeyed, 
and are obeyed in sequence unless a test order (conditional jump) or an 
unconditional jump order directs the machine to proceed to some other order; 
the address of this order is specified in the jump order. 

Arrangements have to be made for the rapid transfer of information between 
the backing-up stores and the immediate or fast access store (the working store). 
Such transfers are usually by blocks of locations. A long programme, for 
example, may have to be stored on the drum; the various parts of it will then 
be brought up into the working store as required. The same part can be brought 
up repeatedly, if necessary, without having to be written back, since reading 
from a store does not destroy the numbers which are read. Similarly data and 
intermediate results can be transferred to and from the backing-up store. All 
these operations are controlled by means of orders in the programme itself. 

11.4 Input and output devices 

Data and programmes are commonly read into electronic computers either 
by means of punched cards of standard type or by means of punched paper tape 
similar to or identical with teleprinter tape. 

Cards are commonly read row by row, and in the larger machines all 80 
columns are read simultaneously. Thus the whole of the data recorded on a 
card can be read in at one card passage. As each row is read the data recorded 
on it are processed by means of a suitable set of input orders, the necessary 
computations being carried out in the intervals between the successive rows. 
When all the rows have been read, and the information has been interpreted 
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and assembled in the required manner, it can be written away in the store or 
used immediately as a basis for computation. The actual operations carried 
out between each card will be controlled by the programme for the job in 

hand. 

Since the interpreting and marshalling of the data are controlled by the 
input orders the data can be punched in any desired form. Commonly the 
standard alpha-numeric code will be used, but other forms of double punching 
may be used if required, and in binary machines the binary code can also be 
used; in the last case each row of the card will represent an 80-digit binary 
number. 

Punched paper tape is read into the machine row by row and processed and 
assembled as it is read in by a suitable set of input orders. Until recently 
standard 5-hole teleprinter tape was commonly used on computers. Six- or 
seven-hole tape has, however, advantages for computer work, and is now being 
adopted in some computers. 

One of the troubles with punched tape is that there has as yet been no 
complete standardisation of code. Consequently the teleprinter equipment used 
for punching and printing is coded in different ways for different types of 
machine, and a tape prepared for one type of machine cannot be read by another 
type unless a special set of input orders is used. 

Cards or punched paper tape are also commonly used for output of results, 
the card or tape punch being operated by the computer by means of a suitable 
set of output orders. The results have then to be printed from the cards or the 
tape, using a tabulator or teleprinter. Recently, however, line printers have been 
developed which are directly controlled by the computer, and print a line at a 

time. 

Rapid development is taking place in the input and output devices, both in 
the matter of speed and in the construction of entirely new devices.^ Thus,^for 
example, typewriters, accounting machines and automatic recording devices 
can be fitted to produce punched paper tape in addition to a visual record, 
and equipment is now available which will read typewritten and printed 
characters. 

11.5 Principles of programuaing 

The set of orders required to carry out a specified computation is known 
as the programme or routine for that computation. Since computers operate at 
very high speed it is essential that programmes are so written that all the 
operations required for a computation, including the reading-in of the data and 
the printing of the results, are carried out in one sequence. If this is done then 
once the programme for a particular computation has been prepared, and is 
known to be correct, all that is necessary when the computation is required is 
to read in the programme and the data. The machine will then perform the 
computation and print out the results with little or no intervention by the 
operator. 
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As we have seen, the operations that can be carried out by single orders are 
of a very elementary type. Consequently the programming of even a simple 
calculation requires a number of orders. Thus, the calculation 

d ==a X b c 

where a, b, c, d refer to store locations or the numbers contained in them, may 
require the following orders : 

(1) Place a in the multiplier register. 

(2) Multiply by b. 

(3) Subtract c. 

(4-) Place the result in d. 

An essential part of each of these orders is the address of the store location, 
a, b, c or d, to which reference is required. 

From previous part of computation 

4- 



Set 


count r to n 



Form a r X b r c r 
Place result in dr 



Subtract 1 from r 
and test r 



r not 



To next part 
of computation 

FIG. 11.6 ELEMENTARY EXAMPLE OF A FLOW DIAGRAM 

In order to economise orders, and for other reasons, it is essential, when the 
same operation has to be repeated a number of times, to write the programme 
so that the same set of orders serves for all repetitions. 

Suppose, for example, we wish to perform the above calculation on a series 
of n values of <z, b and c\ that the values of #, b and c are located in consecutive 
series of store locations, 500~, 600-, 700- ; and that the results are to be located 
in 800-. The same set of orders will clearly suffice for all the computations, 
provided the addresses contained in them are suitably modified at each repetition. 
This modification is effected by what is known as address modification (or 
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B modification) of orders: this can be done automatically on most machines. 
For the calculation of a particular value d r and the location of the result in the 
required address, the four orders given above will have to have reference 
addresses 499 + r, 599 + r, 699 + r y 799 + r. To effect this r is held in what 
is known as a " modifier register," B say, and the four orders each contain (as 
part of the order) the instruction " add the contents of B to the reference 
address of the order before obeying it." To obtain all the required values r 
will have to assume the values n, n 1, ... 1, in turn. 

If a single set of orders is used for a repetitive calculation the computer must 
have some means of telling when the required number of operations has been 
performed. This is done by keeping a count, and testing this count after each 
operation to see if the required number of operations has been performed. 
If the test is not satisfied the machine is instructed to return to the first order; 
if it is satisfied it proceeds to the next part of the computation. In the above 
example r can serve as the count, counting down from n to 1. A closed circuit 
of orders of this kind is known as a " loop." 

The scheme for the above operations can be set out in a diagrammatic form, 
as in Fig. 11.5. Such diagrams are known as flow diagrams. In writing an 
elaborate programme it is best first to construct a flow diagram or set of 
diagrams which exactly define the various operations that have to be performed 
and their inter-relations. The required orders can then be written down. 

The experienced programmer will of course use a much more concise and 
sophisticated notation than that adopted in Fig. 11.5. The principles, however, 
remain the same. 



11.6 Sub-routines 

Programming is much facilitated by having standard sets of orders for 
frequently required operations. These are known as sub-routines. Thus the 
common mathematical operations such as taking a square root will be available 
in sub-routine form. If one or more square roots are required in a computation 
the square root sub-routine is incorporated in the programme. Similarly the 
reading-in of orders and numbers, and the printing of results, are performed 
by input and output sub-routines. The ordinary user of the machine con- 
sequently does not have to trouble himself with the programming of such 
operations. 

Sub-routines are written so that they can be entered from the main 
programme whenever required, with return to the main programme when the 
sub-routine calculation has been completed. 

The sub-routine technique is also a great help in the construction of 
complicated programmes. If in a programme certain operations are repeatedly 
required in different contexts the programmer can start by constructing sub- 
routines for these operations, which are then available to him whenever needed. 
Sub-routines therefore in effect supplement the order code of the machine, 
which can thus be extended to include complex operations of any required type. 
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11.7 Autocodes and interpretative routines 

What are known as autocode routines have now been constructed for many 
machines- These enable the instructions to be written in an abbreviated and 
simplified code which is then read into the machine and translated by means 
of the autocode routine into ordinary machine orders. Two alternatives are 
possible. The autocode routine can interpret the autocode orders as they are 
required, in which case the computation proceeds in parallel with the translation, 
or it can construct a complete programme in the order code of the machine. 
In the latter case provision can be made for the programme to be punched out, 
so as to be permanently available for future use; alternatively it can be utilized 
directly while in the store of the machine, being reconstructed afresh by the 
autocode each time it is required. 

In addition to general autocodes, interpretative routines are frequently 
written to deal with particular types of analysis. These enable the instructions 
for the analysis required provided it is of the type dealt with by the routine-- 
to be very simply and concisely written. The general programme for the analysis 
of survey results, outlined later in this chapter, is an example of this type of 
routine. 

11.8 Survey computations: counts, totals, etc. 

Most of the^basic summarisation of survey data consists of counts of the 
numbers of units occurring in various classes, and the formation of the corres- 
ponding totals of the associated quantitative items of information. As a simple 
example of the way in which such operations may be performed on an electronic 
computer we may consider the formation of the summary tables of yields of 
potatoes given in Section 5.23. 

For any one field there are three relevant items of information, namely the 
region v (1-5), variety v (1-5), and yield y, 

TABLE 11.8 ALLOCATION OF ADDRESSES IN A TWO-WAY TABLE 



Variety 


Total 






Region (r) 
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5 
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700 


701 


702 


703 


704 


705 


1 


706 


707 


708 


709 


710 


711 
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712 


713 


714 


715 


716 


717 

















To form and store a 5 X 5 table and its associated marginal totals 36 store 
locations will be required. These must be identified by addresses; we may, 
for example, choose the locations 700-735 for the table of numbers of fields. 
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A further set of locations, 800-835, say, will be required for the table of yield 
totals. 

A convenient scheme for allocation of the addresses to the cells of the first 
table is given in Table 11.8. 

With this scheme we have the following system of addresses: 

Address of cell (r, v) = 700 + r + 6v 
Address of regional total r = 700 + r 
Address of variety total v == 700 + 6v 

Address of grand total = 700 

This may be generalized for tables with any number of classifications. For a 
P X Q X R table with first address A, for example, the address of cellar will 
be 

A + p + (P + l)q + (P + 1) (Q + l)r 

The addresses of the various marginal totals are obtained by putting one or 
more of p, q y r, zero in this expression. Thus the two-way marginal table for 
the (p y r) totals has the addresses 

A+p+(P+l)(Q + l)r 
The operations required in the above example are then as follows: 

(1) Set all 72 store locations of the tables of numbers and totals to zero 
(i.e. store locations 700-735 and 800-835). 

(2) Read in the values of r, v and y from the card for the first field. 

(3) Calculate 

a 1 = r + 6v, a 2 = r, a 3 = 6v, # 4 = 0. 

(4) Add 1 to the contents of the four store locations with addresses 700 + a * 

(5) Add y to the contents of the four store locations with addresses 800 + & 
Operations (2)~(5) are repeated for each card, until the cards for all fields 

have been read in. On completion, addresses 700-735 and 800-835 will contain 
the tables of numbers of fields and totals of yields respectively. 

If the machine is now instructed to take the number in each of the addresses 
800-835 in turn, divide it by the number in the corresponding address 700-735, 
and place the result in the corresponding address 900-935, these last addresses 
will contain the table of mean yields. 

From the above it will be seen that any store location can be used to 
accumulate a total or a count, in the same manner as a counter is used on a 
punched card tabulator. In consequence the severe limitations that are imposed 
on punched-card tabulations by the very small number of counters available 
are entirely overcome in electronic computers. Furthermore, since the whole 
of each table that has been formed is in the store of the machine at the end of 

* An addition of this kind is not made directly to the number while it is in the store, 
though it may be effected by a single order. Instead the number is read out, the addition 
performed, and the result written back, the old number being automatically erased by 
the writing operation. 
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the tabulation, further derived tables can be computed and^the results can be 
printed out in any desired layout by a suitable sequence of printing instructions. 
To reproduce the layout above, for example, it is only necessary to instruct 
the machine to print out each set of 36 numbers in rows of 6. More 
elaborate layouts, including printed headings, may be produced if required. 
This facility for printing the results in the desired form obviates much 
assembling and copying of results, and eliminates the errors associated with 

such work. 

The above method of computation is, of course, not the only one that can 
be followed. One obvious modification is to omit the formation of the marginal 
totals as the cards are read in; these will then have to be computed from the 
cell totals when the summation is completed. This requires more ^ elaborate 
programming, but if a sub-routine for the computation of the marginal totals 
of multiway tables is available it presents no difficulty. The choice ^of method 
will usually be governed by ease of programming and speed of operation. When 
programming a small ad hoc piece of work the above method may well be 
adopted. In a general programme for the analysis of large-scale surveys the 
formation of the marginal totals separately at the end of the tabulation is 
preferable. 

11.9 Organization of a general programme for the construction of 

summary tables 

We may now consider how the procedure outlined in the last section may 
be generalized. Such generali2ation is important, for if a special programme has 
to be written for each survey that requires analysis, much of the advantage of 
electronic computers will be lost, since the construction of programmes is a 
lengthy task and requires a great deal of involved and detailed work. 

Inherent in any basic tabulation is the unit to which the tabulation relates. 
In what follows we will assume that the values of the relevant variates are read 
in for each unit in turn, as in the example in the last section. 

Variates read from the cards may be of two kinds : 

(a) qualitative, 

(b) quantitative. 

No restrictions (other than those necessitated by the design of the machine) 
need be placed on the quantitative variates, but it will be assumed that all the 
qualitative variates have consecutive integral values from 1 upwards, those not 
satisfying this condition being recoded as they are read in. 

In any particular tabulation the information provided by one or more of 
the original variates, qualitative or quantitative, will be used to determine the 
classes into which the material is to be segregated. If more than one variate is 
used for classification the resultant table will be multiway. In quantitative 
tabulations the variate tabulated may be one of the original quantitative variates 
or a function of a number of them. In frequency-distributions it is the unit 
count (the card count of tabulating parlance) that is tabulated. 
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The classes of any main classification may correspond directly with the 
values of some qualitative variate, or they may consist of groupings of a quanti- 
tative variate, or of some function of quantitative variates. We may also desire 
to group certain values of a qualitative variate, or certain combinations of 
qualitative variates, into a single class. More complicated systems of classification 
can also be envisaged. 

The first requirement of a general programme, therefore, is that it shall be 
capable of forming derived variates from the original variates. The values of 
the derived variates appertaining to the unit that is being dealt with will 
then be stored temporarily alongside the values of the original variates, 
with no further distinction between the original and derived variates. 
Both the original and the derived variates can then be tabulated or used for 
classification. 

The required specifications of the derived variates must be supplied to the 
machine. This can be done by writing the necessary orders for their formation. 
Alternatively, and more conveniently, their specification can be presented in 
some easily written symbolic form which is read into the machine and 
translated by it into the appropriate orders. If this course is followed 
the general programme will have to incorporate a routine which performs the 
translation. 

The main types of specification that are required are : 

(1) Simple mathematical functions of quantitative variates. 

(2) Groupings of a quantitative variate. 

(3) Groupings of a qualitative variate. 

(4) Groupings corresponding to a pair of qualitative variates. 
Specifications of type 1 will produce another quantitative variate, while 

types 2, 3 and 4 will produce further qualitative variates, which as before will 
be taken to have integral values from 1 upwards. It should be noticed that the 
variates produced by specifications of this kind can be used as the basis of 
further specifications. Thus if in a farm survey it is desired to classify farms 
by rent per acre, and the recorded information consists of total rent, x lt and 
total acreage, # 2 , the rent per acre is specified as 



and operation 2 is then performed on X Q to form a new variate x 4 . # 4 will then 
determine the rent-per-acre class. 

Many routines which translate simple mathematical formulae into the 
appropriate orders have now been constructed for electronic computers. We 
may therefore assume that specifications of type 1 can be written in ordinary 
mathematical notation. In addition to the ordinary mathematical functions, 
" logical " functions of the " and " and " or " type will be useful; e.g. (a) if 
*i> *2> X s> are a11 not > tlien x * = 1 otherwise x s = 0; () if any one of x^ 
# 2 , # 3 , . . . is not 0, then x s = 1, otherwise x s = 0. Such functions generate 
qualitative variates. They could be expressed as combinations of specifications 
of types 2, 3 and 4. 
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For specifications of type 2 the following forms may be adopted: 

* 4 = <?* 3 (0 +, 10 +, 20 +, 40 +) 

# 4 = G# 3 (0(X 10)60+) 

# 4 = G* 3 (0 ( X 10) 20 (X 20) 60 +) 

In the first the class limits are specified, in the second and third the grouping 
intervals. In all cases the specification must be such that the number of classes 
in the grouping given by the derived variate is determined. 
Specifications of type 3 may be of the form 

^0 = ^(1,2, 4) (3, 6) (6, 7, 8) 

This causes the classification represented by x 7 , consisting of eight classes 
numbered 1-8, to be grouped into three classes. This specification can be used 
also for changing the order in a classification, e.g. 

* 21 = G* 10 (1)(3)(2)(4) 

A similar form may be adopted for specifications of type 4, e.g. 

* 22 = GxilxJlM (1/2) (1/3) (2/1) (2/2, 2/3) (3/. . .) 

Here six classes are formed from the original nine classes of the 3x3 
classification represented by x s and # 9 . 

The tabulations can be controlled in a similar manner, by specification of 
the classifications (in terms of variates) which are to be used for a particular 
table, and the variate to be tabulated. Thus the tabulation instruction 

-*1 == #1 #22 ^4 

can indicate that the two-way table of sc v with classification by # 22 and # 4 , is 
to be formed, and that this table is to be numbered 1 for future reference. 

Frequency tables can be similarly dealt with by denoting the card count, 
unity, by c. Thus 

1 4 = C I ^22' ^4 

will indicate that a two-way frequency table, with classification by # 22 and * 4 , is 
required. 

Where one of the classifications of a frequency table is a dichotomy, i.e. 
contains two classes only, and the marginal tables over this dichotomy are 
common to a number of tables, there is no need to form the two sets of totals 
for the dichotomy. One set, and the table of marginal totals, will obviously 
suffice. Instructions for an operation of this kind may be written: 

r B = *(* 5 = i):* M ,*4 ( n - 9 - a ) 

T G = c(* 6 ^0):* 22 ,# 4 (11. 9. b) 

The first indicates that a two-way table of the frequencies of the units for 
which # 6 = 1 is to be constructed, the second a similar table of the frequencies 
of units in which X Q is not zero. If this facility is available considerable storage 
space may be saved. 

380 



ELECTRONIC COMPUTERS SECT. 11.9 

In many surveys data are collected on a number of variates, all of which 
require tabulation in a similar manner. Frequently also, a number of different 
classifications are required for each of a set of variates. To avoid writing a 
multiplicity of tabulation instructions a combined instruction of the form 

T i =x 1 .x 2 .x a : x^x 5 . X^X B . x 5y x & (11.9.c) 

may be recognized. This denotes the nine two-way tabulations, with classifica- 
tions given by all pairs of # 4 , x 5 and # 6> of the variates x 19 # 2 and # 3 . Similarly 



will give the corresponding tabulations (three in number) of numbers of units. 

As mentioned previously, it appears best to form the marginal totals of the 
various tables by summation of the cell totals at the end of the tabulation. The 
marginal totals of each table can be stored with the body of the table, using an 
address system of the type given in the last section. If the data for all units 
are complete there will then of course be considerable duplication of marginal 
totals, but these can be used to provide checks when these are considered 
necessary. This procedure also facilitates the handling of incomplete data. No 
specific instructions will normally be required for the formation of marginal 
totals, it being understood that these are formed automatically when the 
tabulations are completed. 

In certain cases instructions for the deletion of certain classes of one or more 
classifications of a table, with the computation of fresh marginal totals, may be 
of value. This provides one method of dealing with incomplete data. If the 
results can be stored on magnetic tape, and amalgamation of certain classes is 
also provided for, condensation of tables can be made without retabulation 
after a study of a preliminary print of the results. 

Further instructions, which may be termed table derivation instructions, 
will be required for the formation of derived tables, such as the table of mean 
yields in the example of the last section. These instructions can be in ordinary 
mathematical notation, with the understanding that the operations indicated 
are to be performed successively on all the cells and margins to produce the 
values for the cells of the new table. Thus if 7\ and T 2 denote the tables of 
total yields and numbers of fields, the instruction for the calculation of the table 
of mean yields may be written: 



If instructions for multiple tabulations are admitted, in the form (11.9.c) 
and (11. 9. d) given above, the instruction 



will be interpreted to mean the execution of the operation on all pairs of 
corresponding tables. Since in this case T^ represents nine tables, and T 2 three 
only, the machine will have to determine, from the formation instructions for 
TI and T 2 , which of the thre* T t tables corresponds to each of the 2\ tables. 
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In addition to the computation instructions outlined above, various ancillary 
instructions are required, namely: 

(a) Read-in instructions, which tell the machine how to interpret the 
information on the tape or punched cards, e.g. which columns of the 
card, or of successive cards if there is more than one card per unit, 
correspond to paiticular variates. 

(b) With fixed-point working, instructions on the scaling of the variates, so 
as to avoid overflow and undue loss of accuracy. 

(c) Instructions for the control of the print format and numbers of decimal 
places to be printed. 

Instructions which are presented to a computer must of course contain only 
those symbols which are available on the tape or card punch. Thus only capital 
letters are normally available, and all letters have therefore to be punched as 
capitals. The available mathematical symbols and punctuation marks also vary 
somewhat from machine to machine. 

11.10 Extension for variable sampling fraction 

In stratified samples with variable sampling fraction the raising of the strata 
totals can be carried out after tabulation, provided that all tables contain a 
classification by strata. Some or all of these tables may, of course, have^ further 
classifications, in addition to the strata classification. The strata classification 
may itself be multiway, as when different sampling fractions are used for 
size-group strata in different districts. 

The raising factors will themselves constitute a table with classification by 
strata. If this table is denoted by T*, the operation of raising may be denoted 
symbolically by 



= 2 x 



In this operation, if for example A and B are stratification classifications, and 
C and D are additional classifications in Table T I9 the cell abed of T x will be 
multiplied by cell ab of T g to give cell abed of T& If the unraised table is not 
required further the raised table may be written in the same store locations as 
the original table, in which case the instructions will be written 

TI - T l X T g 

This type of operation, in which one value or set of values is replaced by another, 
is common in electronic computation, since considerable storage space is often 
saved thereby. 

The operation of raising must be performed before the formation of the 
marginal totals, and provision can be made for it to be given as a general 
instruction covering all tables. 

If the raising factors are unknown at the start of the analysis, as for example 
may occur if there are incomplete returns, these can be computed by the 
machine provided it is furnished with a table T N of the number of units in each 
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stratum of the population. If general raising factors for all tables are required, 
a table of the numbers of units in the various strata can be formed, e.g. 

1^ = c : x it #2 
where x l and x z are the strata classification variates. We then have 

T g = TxIT, 

Alternatively, if the numbers of units in the various strata are available as 
marginal totals of some table T 2 which contains additional classifications, we 
may write the instruction as 

Tfa, # 2 ) = r N /T 2 

The specification of the classifications required for T s is here necessary, since 
with the conventions adopted above, the table of T g would otherwise contain 
all the classifications of T 2 . 

If certain items of information are missing for some of the units, and it is 
desired to allow for this by the adjustment of the sampling fractions, there will 
be a separate table of numbers of units for each variate, and this must be used 
to compute the corresponding table of raising factors for that variate. This 
again can be taken care of by a general instruction if required. 

The above procedure requires that all the data be tabulated by strata. If 
the majority of the information is not required by strata this leads to a multi- 
plicity of tables that have to be subsequently raised and then condensed for 
printing. An alternative procedure is to raise each item when forming the 
tables of totals, i.e. to sum the quantities gx and g instead of x and 1. Since 
electronic computers can multiply very rapidly the time taken may not be greatly 
increased ; indeed in certain cases time may be saved through the reduction in 
storage requirements for the tables. It is of course necessary that the raising 
factors be known in advance of the tabulations. The tabulation instructions 
given above can easily be extended to take care of this type of operation, e.g. by 
using R instead of T to denote a table for which raising item by item is required. 

In certain types of survey, e.g. the survey of Fertilizer Practice (Example 
6.19), the sampling fraction varies from unit to unit, being determined in part 
from information recorded for each unit. In such cases the raising factor g, 
or the part that is variable from unit to unit (in this example g") t can be 
computed as a derived variate by variate specification of the type already given. 

11.11 Group information 

Surveys with information on groups of units, such as household surveys in 
which certain items of information are recorded for households and other items 
for individuals, present special problems. If the data are recorded on cards, 
the household data will commonly be recorded on one set of cards, with one 
card for each household, and the data relating to individuals will be recorded 
on another set of cards, with one card for each individual. 

The procedure for analysis of such surveys by tabulating machinery has 
been discussed in Section 5.18. With an electronic computer the household 
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and individual cards can be read in together, each set of individual cards being 
preceded by the corresponding household card. The information for all the 
individuals of a household can be summarised card by card by the methods 
already outlined. It will be noted that there is no need to gang-punch household 
information on the cards for individuals. 

To deal with analyses of this kind a general programme of the type outlined 
above will require extension in the following ways : 

(a) There must be recognition of the two types of card, so that the informa- 
tion from each type can be appropriately assembled and stored. Two 
separate sets of assembly and storage instructions will be required. 

(b) Any derived household variates which are required for classification in 
tables relating to individuals must be computed from the data on the 
household card before the corresponding individual cards are read in. 

(c) After all the individual cards for a household are read in, the remaining 
household characteristics, some of which may involve data from 
individual cards, must be computed. The necessary entries can then be 
made in the tables relating to households. Thus if a table of frequency 
of households, classified by numbers of individuals in the household 
and numbers of adult males in the household, is required, a count of 
the number of individuals and a further count of the number of adult 
males must be carried; to determine the number of adult males the 
information on age and sex may have to be examined for each individual. 
These two counts must be cleared before the data for the next household 
are read in. The derived variate specifications already outlined are 
adequate for counting operations of this kind. 

(d) Since the number of individuals in a household is variable, and the card 
of the last individual will not normally be specially coded, the next 
household card has to be read in before the operations included in (c). 
The information on this card must therefore be stored temporarily in 
some additional store locations. This information can then be transferred 
to the store locations relating to the current- household on completion 
of the computations for the previous household. 

Extensions (6) and (c) will require amplification of the computation 
instructions already given. The requirements are as follows: 

(1) Indication must be given in the specifications of derived variates of 
whether they are to be computed before the first individual card of a 
household is read in, or after each individual card, or after the last 
individual card of the household. 

(2) Similar indication must be given in the tabulation instructions of whether 
entry is to be made after each individual card (tables relating to 
individuals) or only after the last individual card of a household (tables 
relating to households). 
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Provisions of the above type will serve for the analysis of two-stage samples 
in which the first-stage units are distinguished by information carried on 
separate cards. They will also enable overflow information carried on trailer 
cards (Section 5 . 17) to be handled. If the value of a variate recorded on the 
main card is placed in x l9 and the value on each trailer is placed in # 2 , x% being 
added to x 1 and the result placed in x 1 after each trailer, x^ will contain the 
correct total value after the last trailer. 

Two-stage samples in which the first-stage units are distinguished by change 
of first-stage reference number, and for which tabulation of the individual 
first-stage units is not required, can be similarly dealt with. In this case 
recognition of the two types of card ( (a) above) is replaced by recognition of 
change of first-stage reference number. 

In some cases information on the individual members of a group is carried 
in parallel fields on the same card, possibly with additional trailer cards for 
groups for which the number of individuals exceeds the number of fields 
provided. It is then necessary to assemble the information from the separate 
fields in the same set of tables. If, for example, there are three fields, and 
variates # 1; # 2 > #3 are assigned to some quantitative item from these three fields, 
variates # 4 , x 5t x$ being similarly assigned to a corresponding qualitative item, 
and x 7 is a qualitative group variate, tabulation of the quantitative item, classified 
by the two qualitative items, will be effected by the tabulation instructions 



The identity of the table numbers indicates that all three entries are to be made 
in the same table. An associated count will, of course, be required. Trailers 
will be accommodated by making provision for the appropriate variate speci- 
fications and tabulation instructions to be obeyed after each card, whether main 
or trailer. 

11.12 Ratio and regression estimates 

The above general programme can be used without further extension to 
produce ratio and regression estimates. 

When a common value r of the ratio is used, this is given by the ratio of 
the sample totals of y and x, or by the ratio of the raised totals if the sampling 
fraction is variable. These totals will normally be available as the grand totals 
of tabulations, 2\ and T 2 say, of x and y, and we may write 



where the (.) indicates that T 3 has no classification, i.e. represents a single 
value only, in this case r. 

Similarly when the ratio is permitted to assume different values for the 
different strata, the table of r,- will be given by the ratios of the strata totals. 
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If x l represents the stratum classification we may write 



If T 4 represents a table of population totals of *, for classifications ^ and 
x 2 , then the instruction 

r 5 = r, x r 4 

generates the corresponding table of estimated population totals of y, with the 
proviso that if the ratio varies with strata an additional instruction must be given 
for computing fresh marginal totals over strata. The values of table T 4 must 
of course be supplied to the machine. ^ 

To obtain regression estimates, with estimation of the regression coefficient 
b from the sample, the values of S(x*) t S(xy) and S(y*) are required, either for 
the whole sample or for the separate strata. These can be obtained by specifying 
derived variates * 2 , xy and y\ and forming the required tables of their totals. 
The instruction for computing the regression coefficient, or a table of its values 
if assumed different for the different strata, can then be provided by writing 
the appropriate formula from Section 6.13, 6.14 or 6.15 in terms of the 
relevant T's, remembering that S(x - *) 2 = Sx* - *Sx, etc. The instructions 
for computing the final tables of means and totals can be similarly provided. 
To simplify the writing of instructions of this kind, and the similar instructions 
required for the computation of sampling errors (Section 11 . U), comprehensive 
autocode instructions embodying the more commonly needed functions may 
be provided. 

11.13 Two-phase sampling and sampling on successive occasions 

In two-phase sampling entries must only be made in the tables of the 
second-phase variates for those units which have second-phase information. In 
addition, if ratio or regression estimates based on first-phase information are 
required there must be subsidiary tabulations of some of the first-phase variates 
covering only those units which have second-phase information, together with 
tabulations of the ratios or sums of squares and products of the first- and 
second-phase variates. These subsidiary tables and the tables of second-phase 
variates may be distinguished by the designation P. 

The machine must be able to recognise which units have second-phase 
information. These can be done by testing some variate, # x say, which is zero 
if, and only if, there is no second-phase information. If % = 0, entries are 
made only in tables T; if not, entries are made in tables P also. 

In sampling on successive occasions, with estimates of the type given in 
Sections 6.21 and 6.22, similar subsidiary tabulations relating to the units that 
are common to the two occasions will be required. If the data of the previous 
occasion have already been analysed on the computer, it may be presumed, if 
magnetic tape is available and the survey is a large one, that these data will have 
been stored on the tape, and will not have to be read in afresh from the cards. 
As the data for each new unit are read in, the machine will have to test whether 
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the unit was also sampled on the previous occasion. For this purpose some 
common reference system will be required. In order that the test may be made 
expeditiously it is important that the units on all occasions are read in and 
stored in the order of the reference numbers. If this were not done, the machine 
would have to search through the whole of the data of the previous occasion 
(or at least a batch of such data) to see if there were a matching reference 
number. 

In units common to the two occasions the information relating to the previous 
occasion will also have to be retrieved from the magnetic tape, and transferred 
to some appropriate variate locations in the working store. The subsidiary 
tabulations can then be made unit by unit in the same manner as for two-phase 
sampling. At the end of the tabulation the final adjusted estimates can be 
computed by a set of table derivation instructions embodying the formulae 
of Section 6.21 or 6.22. 

If the data for the previous occasion have not been stored on magnetic tape 
they will have to be read in afresh from the original cards. In this case these 
cards can be interspersed with the cards of the current occasion by a preliminary 
sort, so that the information for the previous occasion is available conjointly 
with that for the current occasion unit by unit. 

With magnetic tape storage the values of the required derived variates for 
the previous occasion can be stored on the tape together with the basic data. 
If the data are read in afresh recomputation will be necessary. 

11.14 Computation of sampling errors 

A general programme of the type outlined above can be used for the 
computation of sampling errors. The required sums of squares and products 
can be computed by the method described in Section 11 . 12, and the calculation 
of the standard errors can be effected by the appropriate set of table derivation 
instructions. 

As an example we may consider the computation of the standard error of 
the estimate Y of the population total derived from a ratio estimate in a stratified 
sample with uniform sampling fraction. 

Take variate x^ to represent the strata classification and assume the following 
numbering of tables : 

From the tabulation Derived tables Prior information 

T! T, f T u / 

T, Sx T s P, T u t 

T s Sy T 9 Q T 14 X 

T 4 Sx* T 10 SJ 

T 5 Sxy T u S.E.(Y) 

T, S/ 

All these tables are assumed to be classified by x v except T T and T ltr T 14 , 
which represent single values. 
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The instruction 



will be taken to denote the recomputation of the marginal means over variate x s . 
The formulae are given in Sections 7.8 and 7.9. If the ratio is permitted 
to take different values for the different strata the required instructions are as 
follows: 



T 8 



T e - 2T 7 x T s + T 7 x T 7 x 
M(T 9 : *j) 



ru(-) = T u Sqrt ( (1 - T 12 ) X (2\ X T 10 )/(T 2 X T 2 )) 
If the ratio is assumed to be the same for all strata the first two instructions 
are replaced by 



T 9 = ( j 6 _ T, x IV 7\) - 2T 7 x (T 5 - T 2 x 

+ T 7 xT 7 x (T 4 - T 2 X 

In the case of multistage sampling the computation of the first-stage sampling 
errors requires the evaluation of sums of squares and products of the totals, 
etc. appertaining to each first-stage sampling unit. If there is tabulation by 
first-stage sampling units these can be obtained directly from the tabulation 
totals. If, however, such tabulations are not required, and the number of 
first-stage units is large, it will be better to build up the required sums of 
squares and products in the course of the tabulation. This can be done by the 
procedure outlined for dealing with group information (Section 11 .11), treating 
each first-stage sampling unit as a group. The cards must of course be grouped 
by first-stage units. Change from one group to the next can be detected by 
comparing the current value of the variate giving the identification of the 
first-stage sampling unit with the preceding value. 

Even with electronic computation there will rarely be any advantage in 
computing all the sampling errors associated with the estimates of a large-scale 
survey. All that are normally required are the errors of a few key estimates, 
from which the general adequacy of the sampling can be judged. 

11.15 Incomplete data 

Incomplete records are always troublesome, and if only a few units have 
incomplete information it may be best to reject these units entirely. When, 
however, data are lacking on a few of the items for a substantial proportion of 
the units, complete rejection of these units will considerably reduce accuracy 
on the other items and may lead to biased results. Provision for the tabulation 
of incomplete data must therefore be made. To effect this an impossible value, 
which can be tested for when required, can be inserted as the card is read in. 
The largest possible negative number will serve this purpose. In the case of 
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qualitative variates any negative number would serve, since such variates have 
been defined so as always to be positive, but for certain purposes it is convenient 
to have the same convention for both quantitative and qualitative variates. 

The treatment of missing values in the formation of derived variates 
(Section 11.9) will then be subject to the following rules. In specifications of 
type 1, if any variate entering into the function is missing, the derived variate 
is coded as missing. In type 2 any value outside the permitted range may be 
coded as missing. In type 3 the " missing " class will be preserved, without 
mention in the specification. In type 4 if either of the variates is missing the 
derived variate is coded as missing. 

In the tabulations missing values of a variate used for classification can be 
dealt with by providing an extra (" unclassified ") class in the variate. In order 
to be accommodated in the table address system given in Section 11.8 the 
" unclassified " class must be assigned a class number one greater than the 
number of classes in that classification. When missing values are admitted it is 
simplest to provide such an extra class for all classification variates, but this 
may require considerable additional storage space. To avoid this, arrangements 
may be made to specify which classifications require such a class. If unclassi- 
fiable material is to be eliminated from the final printed tables condensation of 
the tables, by the omission of the " unclassified " classes, will be required. 

An alternative is to omit the unclassified material from the table, or if a 
total check at the end of the tabulation is required, to place it all in a single 
additional cell. The margins of such a table will of course exclude material which 
is unclassifiable because of missing values of the other classification variates. 

Missing values of the variate which is being tabulated are more troublesome, 
since no entry must be made in the table or in the associated table of total 
frequencies. Consequently when missing values are admitted every table of 
the totals of a quantitative variate must have its own table of total frequencies. 
This may be specified in the instructions by writing, for example, 

T 2 = <?i ^2 * ^3 x to x & ' X &> X 6 * X 5i X & 

instead of instruction (11 . 9 . d) above. The same rules operate when constructing 
the frequencies of one half of a dichotomy; the instructions (11. 9. a) and 
(11 .9 -b) above must have associated with them the instruction 

J" 7 = c & ', # 2 2'* r 4 

it being understood that (x & ^0) in instruction (11.9.b) excludes missing 
values as well as 0. 

In ratio and regression estimates, and also in all estimates of error involving 
sums of products, all the variates of a unit must be treated as missing in the 
relevant tables if any one of them is missing. 

11.16 Influence of size of computer on methods of analysis 

In the above discussion we have assumed that there is sufficient storage 
space to accommodate all the tables which are required at the various stages 
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of computation. In a machine with a large immediate access store the idea! 
is to hold all the tables in this store. If the store is not large enough for this, 
some of the tables will have to be held in the backing-up store, and the required 
blocks will then have to be transferred to the working store and back again 
every time an entry has to be made. This will considerably increase the com- 
puting time, and in certain cases it may pay to carry out the analysis in two 
or more runs, compiling a different set of tables at each run. Whether this is 
advisable will depend on the nature of the survey and the characteristics of the 

computer. 

A further alternative for economizing storage space is to omit the storage 
locations required for the marginal totals while performing the tabulation. The 
tables can then be expanded to make room for the margins when the tabulation 
is complete. Storage can also be saved by packing two or more quantities in 
each store location. Thus if the machine works with 40-digit binary numbers, 
pairs of positive or negative numbers not exceeding 2 19 , i.e. approximately 
500,000, can be stored together. The packing and unpacking is effected by 
embodying the necessary sets of orders in the programme at the required points. 

In machines which have only a small drum store, storage limitations become 
much more serious, and several runs of the data may be inevitable. In such 
machines, also, limitations of storage capacity will prevent the construction of 
any comprehensive general programme. The most that is likely to^be profitable 
is to construct a set of sub-routines embodying the basic operations outlined 
above, together with a framework of orders the connective tissue of ^ the 
programme -into which sets of orders for calculating the required functions 
can be fitted.* 

Even when the available computer is too small for a general programme the 
notation developed in the preceding sections will still be of value, since it can 
be employed to set out in exact form the analytical operations that are required. 
When this has been done the construction of the actual programme will be 
much easier. The same applies when a large computer for which no general 
programme has been constructed has to be used. 

A further way of economizing storage space, and also simplifying the tabulat- 
ing operations, is to sort the data by one of the main classifications before 
reading it into the computer. In the National Farm Survey (Section 5.21), 
for example, there was stratification by size-groups, and tabulation by size- 
groups was in fact required for all variates. If, therefore, in such an analysis 
the cards are sorted by size-group (as in punched-card tabulation) the tabulations 
can proceed size-group by size-group, the results being punched out (or 
transferred to magnetic tape) at the end of each size-group, and read back for 
raising, etc. when the tabulation is complete. It will be noted that with this 
procedure no classification by size-group is required at the tabulation stage. 
Since there are five size-groups the storage requirements are cut to one-fifth. 

* Since the above was written the possibility of constructing a general ^ survey 
programme for the Rothamsted computer, which is a drum machine with approximately 
3000 words of storage, has been investigated ; a programme much on the above lines 
has been found to be practicable. 
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When tabulation by strata is not required, and the data are read in stratum 
by stratum, raising can be performed at the end of each stratum, the raised 
totals being accumulated in a separate set of store locations. At the end of the 
tabulation these locations will contain the required tables. If unraised as well 
as raised totals are required, two additional sets of storage locations will be 
necessary. 

With cards sorting or re-sorting presents no difficulty. With paper tape 
the data must be sorted before punching; re-sorting data on paper tape can 
only be done by feeding the data into the machine and punching a new tape. 
The same applies to data on magnetic tape, though the operation is much 
quicker. 

Even with magnetic tape the re-sorting of any large body of data is somewhat 
troublesome and requires several transfers to fresh tape. Careful thought 
should therefore be given at the outset to the most advantageous order in which 
to present the data to the computer. 

A small computer can' also be used as an adjunct to standard punched-card 
equipment, for the calculation of functions of variates such as ratios and indices. 
When so used the calculated values will be punched out on a new set of cards. 
These values can then be transferred to the original cards by means of a 
reproducer punch, or alternatively part or all of the data on the original cards 
can be similarly transferred to the new cards. 

11.17 Coding of information 

With an electronic computer there is no need to adopt any elaborate coding 
for the input of information, since the data can be interpreted and coded by 
the computer as they are read in. This greatly facilitates the punching, as the 
information contained in the basic records, provided it is reasonably concise, 
can be punched in the form and order in which it is recorded. 

Letter as well as number codes can be used for punching when convenient. 
Thus in an agricultural survey involving records of crops the first two letters 
of the crop can serve as the code. If this leads to any ambiguities, as for example 
between cabbages, cauliflowers and carrots, an alternative code such as CB, 
CU, CR, can be adopted for these crops. Provided CA is not used for any 
other crop inadvertent coding of any one of these crops by the first two letters 
will then be detected, since the code will be inadmissible. 

In surveys for which the basic records contain a large number of blanks, 
as for example will occur on questionnaires in which a subsidiary group of 
questions has to be answered only if the answer to some main question is yes, 
punching of these blanks can be avoided entirely by designating each positive 
item of information, or a suitable selection of them. The mode of designation 
must of course be so arranged that the machine can recognise which items of 
information are being recorded. Such designation is particularly convenient 
for tape input, but can be used with card input if required. If it is so used, 
however, the advantage of being able to sort and tabulate the cards on ordinary 
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punched-card equipment may be lost, unless the computer is used to interpret 
the designated information and punch out a secondary set of cards. 

11.18 Preliminary editing of data 

Experience has shown that any large body of data Is likely to contain errors 
of recording and transcription. This is so even when the data are collected by 
experienced observers, and there is rigorous checking of the transcription at all 

stages. 

Methods of testing for such errors have been discussed in Sections 5.20 
and 10.13. When punched-card equipment is used, however, any but the 
simplest tests require a great deal of card sorting and processing, particularly 
if the relations between more than two variates are involved. Electronic 
computers enable much more thorough and searching tests to be made, since 
complex functions can be readily computed and the required tests immediately 

applied. 

Although in principle the application of tests of this kind on an electronic 
computer is simple, their practical organisation presents certain problems. In 
the first place suitable test functions have to be chosen. Secondly, when the 
test functions have been decided, acceptance limits have to be determined. 
Thirdly, the action to be taken when one or more tests fail has to be determined. 

The'test functions should be chosen so that all variates of any importance 
which can be effectively tested are so tested, and the tests should be so arranged 
that as far as possible the pattern of failures itself reveals which variate is likely 
to be in error. Thus if four variates x l9 # 2 > *3> *4 are a11 hi g hl Y correlated, tests 
of #! against a? a , and # 3 against # 4 , will suffice to reveal a gross error in any one 
of the variates, but will not determine which of the pair of variates for which 
the test fails is at fault. If the additional test ?>f x 2 against x 3 is included the 
variate at fault can be identified. As a further safeguard the test of x^ against # 4 
may also be included. 

In certain cases acceptance limits can be determined from the nature of the 
test. Thus in a farm survey in which the acreages of the various crops are 
reported and also the total acreage of cultivated land, only a trivial discrepancy 
in the summation check will be admissible. In most cases, however, the 
acceptance limits will have to be determined from the actual data. Frequently 
also the numerical constants entering into the test function must be similarly 
determined. 

Suppose, for example, two variates x and y are highly correlated. Then all 
the points (x 9 y) (except those in which x or j are subject to gross error) will 
be close to the line 



representing the regression of y on x. We may therefore use the deviations 
from this line 
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as the test function. If x and y are equally liable to error it is best to use the line 
intermediate between the two regression lines, y on x and x on y, given by 



The values of y, x, and b have to be determined from the data. When they 
have been determined the deviations of all points can be calculated and their 
frequency distribution constructed. Examination of this frequency distribution 
will usually enable suitable acceptance limits to be chosen without much 
difficulty. The individual deviations have then to be tested and the necessary 
action taken when the test fails. 

These operations will require three separate passages of the data through the 
computer, or if the values of the individual deviations are stored on the second 
passage, together with the reference numbers of the units, a further passage 
of these stored deviations. If magnetic tape is available this will of course be 
used for storage of the data on the first passage, and if the deviations are stored, 
for these also. 

The operations of calculating the values of y, x and b, and the values 
and frequency distribution of the deviations, can be dealt with by a general 
programme of the type already outlined. The storage of the deviations can 
also be handled by the general storage instructions which will in any case have 
to be provided for the storage of data on magnetic tape. New types of instruction 
will have to be provided for the tests, and for the action to be taken when they 
fail. 

The action required depends on circumstances. If further examination of 
the records showing anomalies is considered appropriate, all that is necessary 
is to instruct the computer to print a reference number by which the unit can 
be identified and an indication of the nature of the inconsistency, and possibly 
also its magnitude. If speed of analysis is essential it may be decided that the 
best course is to reject anomalous units entirely or to reject that part of the 
information which gives rise to the anomaly. In such cases a record should, 
of course, be printed of the units rejected and the reasons for their rejection; 
a subsequent post mortem examination can then be made if required. In certain 
cases it may be possible to instruct the computer to make a correction without 
further examination. 

In these last two cases the general analysis can proceed in parallel with the 
tests. The amended data should be stored afresh on the magnetic tape so that 
additional analyses can be carried out without further amendment if required. 
If amendments are made by hand and magnetic tape is used, provision must 
be made for correcting the tape record before the final analysis is under- 
taken. 

It will be noted that errors in punching are controlled by tests of the above 
type. It is, however, rarely possible to exercise close control over all variates, 
and rigorous control of punching errors by proper verification, etc. must 
therefore be maintained. 
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It is also good practice to examine extreme values of all quantitative variates 
for which control of the above type is impossible, and to check that there are 
no impossible values of qualitative variates. 

11.19 Critical analysis 

In various places in this book there has been discussion of the ways m 
which the crude tables of class means, percentages, etc., can be examined to 
throw light on the underlying relationships and effects of various factors. The 
computations required for such analyses, as the reader will no doubt have 
become aware, are considerably more involved than those required for the 
preparation of straightforward summary tables, and are not of the type which 
can be carried out on punched-card machines. Hitherto, therefore, desk 
calculation has been required, and the amount of computation needed has been 
such as to prevent much work of this kind from being undertaken. ^ 

Electronic computers are very suited to this type of computation, since, 
properly programmed, they are capable of carrying out elaborate analyses from 
start to finish. Once programmed for a particular type of analysis, moreover, 
the same analysis can be repeated with a minimum of trouble. Such repetition 
is frequently required in large-scale surveys both for the different quantitative 
and qualitative variates under study, and for different batches of data, e.g. 
different regions in an area survey, and different occasions in a survey repeated 
at intervals. 

One type of analysis for which the Rothamsted computer has already^ been 
found of great value is the fitting of constants to rnultiway tables. A very simple 
example of a two-way table of quantitative data has been given in Section 5 . 24. 
Even in this simple case it was thought worth while (Section 5.23) to give 
approximate methods of analysis, by which the labour of fitting constants 
could be avoided. Such approximate methods, however, require thought and 
judgement, and have the attendant disadvantage that the resultant estimates 
are not the most accurate possible, and will vary according to the method used. 

A further example of a similar analysis of a four-way table of qualitative 
data, using the logit transformation, is given in Section 9.7. The method used 
in this latter example, and in the similar example of quantitative data in Section 
9.6, is only applicable to tables in which all classifications are two-fold, and 
in* the case of the example on qualitative data is a first approximation only; 
a second approximation requires a more elaborate procedure and more than 
doubles the amount of computation required. 

The programme written for the Rothamsted computer will fit constants to 
multiway tables with up to five classifications, with the overall limitation that 
(P + 1) (0 + 1) (jR + 1). . . < 256, where P, 0, R, . . .are the numbers of 
classes in the various classifications. This limitation is solely due to the 
characteristics and capacity of the store (2944 words) which by modern standards 
is a small one. Either quantitative or qualitative data can be handled, and in 
the latter case the logit, probit, angular or log-log transformation can be used 
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as desired. Moreover, if required, constants representing sub-classifications can 
be' included ; thus in a three- way table constants representing the main 
classifications A, B and C and the sub- classifications A X B, A X C can be 
fitted. In all cases the sum of squares accounted for by the fitting process and 
the residual sum of squares are computed. 

The method followed for the quantitative case is the method of successive 
approximation given in Section 5.24. In the qualitative case this method is 
used as the second stage of a two-stage approximation process, with the ordinary 
maximum likelihood procedure forming the first stage. 

This type of analysis has proved extremely useful in sorting out the apparent 
effects of various factors in animal disease surveys. Table 11 . 19 .a provides an 

TABLE 11. 19. a PERCENTAGES OF cows AFFECTED BY MILK FEVER 

AND NUMBERS OF CALVINGS 



Lactation 

4 5 



6+ Overall 



(a) Percentage affected 



Jan .-April 
May July 


0-42 
0-17 


1-45 

3-08 


3-15 

7-37 


5-57 
9-43 


6-07 
10-94 


2-67 
4-17 


Aug.-Sept. 


0-66 


4-89 


9-40 


10-93 


13-98 


4-25 


Oct.-Dec. 


0-35 


2-93 


6-07 


9-61 


12-19 


3-50 


Overall 


0-41 


2-75 


5-58 


8-06 


9-38 


3-46 



(b) Calvings 


Jan .-April 
May July 
Aug.-Sept. 
Oct.-Dec. 

Overall 


3806 
2352 
3169 
5117 


2137 
1006 
1003 
1740 


1744 
706 
617 
1252 


1184 
488 
366 
791 


2010 
841 
522 
1042 


10881 
5393 

5677 
9942 


14444 


5886 


4319 


2829 


4415 


31893 



example of the type of data collected. The table gives the percentage of cows 
affected by milk fever in a lactation and the total numbers of calvings, classified 
by season of calving and number of lactation. It is immediately apparent from 
an examination of the percentages in the body of the table, that not only does 
the lactation number have a large influence on the incidence of milk fever, but 
also that there are definite differences with seasons, the incidence being highest 
in August-September, and lowest in January-April. The difference between 
the May-July and August-September periods, however, is largely obscured in 
the marginal percentages, owing to the greater proportion of calvings in 
August-September for the early lactations. Estimates of the marginal percent- 
ages, freed from this disturbing influence, are therefore required. 

The logit transformation was used for fitting. The fitted values, their 
transformed values in terms of percentages, and the corresponding percentages 
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adjusted proportionally to give the observed overall percentage are shown in 
Table 11.19.b. (The need for this adjustment is obvious. Its justification 
need not concern us here.) 

TABLE 11. 19. b FITTED CONSTANTS AND CORRESPONDING 

PERCENTAGES 

Percentage 

Lactation Logit Fromlogit Adjusted 
1-2 - 2-781 0-38 0-36 

3 - 1-778 2-78 2-63 

4 - 1-390 5-85 5-54 
6 - 1-184 8-56 8-11 

' 6+ - 1*087 10-23 9-69 

Percentage 

Season Logit Fromlogit Adjusted 
Jan.- - 1-649 3-56 2-06 

May- - 1-329 6-54 3-78 

Au g._ -1-156 9-01 5-21 

Oct.- - 1-324 6-61 3-82 

It will be seen that the final marginal percentages bring out the differences 
between seasons which were apparent from the examination of the body^of 
the table. Moreover the value of the residual sum of squares, 17-6, which 
corresponds to a % 2 with 12 degrees of freedom, gives a value of P > 0-1 and 
< 0-2, indicating that the table is reasonably represented by additive constants 
in the logit scale. 

An important general point to notice about this example is that the 
statistician who is responsible for the analysis of the results need no longer 
concern himself with the computational procedure of the fitting process, or the 
theory on which it is based. All this is taken care of by the electronic computer, 
and is solely the concern of the person who has to compile the programme. 

Critical analyses of this type will normally only be undertaken after the 
results of the main tabulations have been examined, since it is only then that 
it can be decided which analyses are likely to be worth while. If a number of 
such analyses have to be undertaken, a good deal of re-punching will be avoided 
if the results of the original tabulation analysis are stored in such a manner 
that the necessary tables can be retrieved when required. It is therefore good 
practice, for this as for other reasons, to store the results of the main analysis 
on magnetic tape. If magnetic tape is not available there is in principle no 
difficulty in reading back the results from the cards or paper tape used for their 
output, but the retrieval of a few tables from a large mass of results punched 
on paper tape may well be troublesome. 

11.20 Speed of analysis 

It will be apparent from what has already been written that the analysis of 
large-scale surveys, when properly organised, can be carried out much more 
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expeditiously on an electronic computer than by any previously available method. 
This is an immense advantage in surveys the results of which are required at 
the earliest possible moment after their completion. Even in investigational and 
research surveys, where speed is not of crucial importance, this gain in speed 
is of very real benefit, and results in a quickening of the whole tempo of research. 

Proper organisation is, however, essential. Even if a general survey analysis 
programme of the type outlined above is available the writing of the necessary- 
instructions takes a good deal of time; if no extensive programming aids are 
available the programming of the analysis of an elaborate survey can be a very 
lengthy process. However the programme is compiled, it must also be thoroughly 
tested to make certain that it contains no errors, and that all contingencies have 
been allowed for. 

Before a start can be made on the actual programming, decisions have to 
be taken on what tabulations are required, what preliminary editing should be 
undertaken, and a host of other details. Final decisions on many of these points 
can frequently only be arrived at after a, preliminary analysis of some data, and 
for this purpose, as well as for programme testing, the results of a pilot study 
or a previous similar survey should always be utilized if available. 

11.21 Arithmetical accuracy 

Questions are often asked regarding the accuracy of computations carried 
out on electronic computers. It is of course true that computers occasionally 
make mistakes, but the accuracy attained may be expected to be far higher than 
that attained with most other forms of computation. 

There are several reasons for this. In the first place, since a computer uses 
numbers as orders, any storage defects will be rapidly revealed by the misreading 
of orders. Such misreading normally results in entirely nonsensical results, or 
in failure to follow the programme. Failure in this respect will usually bring 
the machine to a speedy halt or cause looping, i.e. repeated traversing of the 
same loop of orders. Secondly the reliability of machines is continually being 
improved, and most large modern machines have a number of built-in checking 
devices which stop the machine or take remedial action if an error is detected. 
Thirdly the wise programmer incorporates a sufficiency of checks in his 
programme to be reasonably assured that the arithmetical operations are being 
carried out with a high degree of reliability. 

Moreover the accuracy of the final results depends not only on the accuracy 
of the computations but also on the accuracy of the data supplied to the machine. 
The more extensive editing that can be undertaken with electronic computers 
may be expected greatly to reduce the incidence of gross errors in the data 
that are actually processed. 

The most troublesome source of errors in electronic computation is, in fact, 
faults in the programme. Such faults will often produce obviously nonsensical 
results, but these may not be immediately noticed if they are embedded in a 
mass of otherwise correct material. Moreover in certain circumstances plausible 
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but incorrect results may be produced, as for example when one of the variates 
entering into a function is wrongly specified, and the substitute variate is highly 
correlated with the correct variate. All programmes should therefore be very 
thoroughly tested before use, particular care being taken to ensure that parts 
of the programme that are only operative in certain rare combinations of 
circumstances are covered by the tests, and that the numerical constants and 
control words that are required for a particular analysis are correctly specified. 
Faults in a programme can also cause the programme to fail entirely, and 
this may also happen only in rare circumstances. Such faults, if undetected 
in the tests, can be very troublesome, since they may only reveal themselves 
after the programme has been in use for some time, and are frequently difficult 
to locate and correct. They may thus cause serious disruption of the timetable 
of urgent jobs. 



TABLE Al RANDOM NUMBERS 



03 47 43 73 86 


36 96 47 36 61 


46 98 63 71 62 


33 26 16 80 45 


97 74 24 67 62 


42 81 14 57 20 


42 53 32 37 32 


27 07 36 07 51 


16 76 62 27 66 


56 50 26 71 07 


32 90 79 78 53 


13 55 38 58 59 


12 56 85 99 26 


96 96 68 27 31 


05 03 72 93 15 


57 12 10 14 21 


55 59 56 35 64 


38 54 82 46 22 


31 62 43 09 90 


06 18 44 32 53 


16 22 77 94 39 


49 54 43 54 82 


17 37 93 23 78 


87 35 20 96 43 


84 42 17 53 31 


57 24 55 06 88 


77 04 74 47 67 


21 76 33 50 25 


63 01 63 78 59 


16 95 55 67 19 


98 10 50 71 75 


12 86 73 58 07 


33 21 12 34 29 


78 64 56 07 82 


52 42 07 44 38 


15 51 00 13 42 


57 60 86 32 44 


09 47 27 96 54 


49 17 46 09 62 


90 52 84 77 27 


18 18 07 92 46 


44 17 16 58 09 


79 83 86 19 62 


06 76 50 03 10 


26 62 38 97 75 


84 16 07 44 99 


83 11 46 32 24 


20 14 85 88 45 


23 42 40 64 74 


82 97 77 77 81 


07 45 32 14 08 


32 98 94 07 72 


52 36 28 19 95 


50 92 26 1.1 97 


00 56 76 31 38 


80 22 02 53 53 


37 85 94 35 12 


83 39 50 08 30 


42 34 07 96 88 


54 42 06 87 98 


70 29 17 12 13 


40 33 20 38 26 


13 89 51 03 74 


17 76 37 13 04 


56 62 18 37 35 


96 83 50 87 75 


97 12 25 93 47 


70 33 24 03 54 


99 49 57 22 77 


88 42 95 45 72 


16 64 36 16 00 


04 43 18 66 79 


16 08 15 04 72 


33 27 14 34 09 


45 59 34 68 49 


12 72 07 34 45 


31 16 93 32 43 


50 27 89 87 19 


20 15 37 00 49 


52 85 66 60 44 


68 34 30 13 70 


55 74 30 77 40 


44 22 78 84 26 


04 33 46 09 52 


74 57 25 65 76 


59 29 97 68 60 


71 91 38 67 54 


13 58 18 24 76 


27 42 37 86 53 


48 55 90 65 72 


96 57 69 36 10 


96 46 92 42 45 


00 39 68 29 61 


66 37 32 20 30 


77 84 57 03 29 


10 45 65 04 26 


29 94 98 94 24 


68 49 69 10 82 


53 75 91 93 30 


34 25 20 57 27 


16 90 82 66 59 


83 62 64 11 12 


67 19 00 71 74 


60 47 21 29 68 


11 27 94 75 06 


06 09 19 74 66 


02 94 37 34 02 


76 70 90 30 86 


35 24 10 16 20 


33 32 51 26 38 


79 78 45 04 91 


16 92 53 56 16 


38 23 16 86 38 


42 38 97 01 50 


87 75 66 81 41 


40 01 74 91 62 


31 96 25 91 47 


96 44 33 49 13 


34 86 82 53 91 


00 52 43 48 85 


66 67 40 67 14 


64 05 71 95 86 


11 05 65 09 68 


76 83 20 37 90 


14 90 84 45 11 


75 73 88 05 90 


52 27 41 14 86 


22 98 12 22 08 


68 05 51 18 00 


33 96 02 75 19 


07 60 62 93 55 


59 33 82 43 90 


20 46 78 73 90 


97 51 40 14 02 


04 02 33 31 08 


39 54 16 49 36 


64 19 58 97 79 


15 06 15 93 20 


01 90 10 75 06 


40 78 78 89 62 


05 26 93 70 60 


22 35 85 15 13 


92 03 51 59 77 


59 56 78 06 83 


07 97 10 88 23 


09 98 42 99 64 


61 71 62 99 15 


06 51 29 16 93 


68 71 86 85 85 


54 87 66 47 54 


73 32 08 11 12 


44 95 92 63 16 


26 99 61 65 53 


58 37 78 80 70 


42 10 50 67 42 


32,17 55 85 74 


14 65 52 68 75 


87 59 36 22 41 


26 78 63 06 55 


13,08 27 01 50 



This table forms part of a larger table of random numbers given ic Statistical Tables for Biological, 
Agricultural and Medical Research by R. A. Fisher and F. Yates, Oliver Boyd, Edinburgh ($rd edition, 
1948), and is reproduced bf kind permission of the senior author and the publishers, 
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TABLE A2 THE NORMAL DISTRIBUTION 

Probability of obtaining deviations (positive or negative) greater 
than given multiples of the standard deviation 



Deviation 

*/o 


Probability 
P 


Deviation 

z\<s 


Probability 
P 


Deviation 
2,0 


Probability 
P 


0-0 


1-0000 


1-0 


-3173 


2-0 


0455 


0-1 


-9203 


M 


2713 


2-1 


0357 


0-2 


-8415 


1-2 


2301 


2-2 


-0278 


0-3 


7642 


1-3 


1936 


2-3 


-0214 


0-4 


-6892 


1-4 


-1615 


2-4 


-0164 


0-5 


6171 


1-5 


1336 


2-5 


0124 


0-6 


5485 


1-6 


1096 


2-6 


0093 


0-7 


4839 


1'7 


0891 


2-7 


0069 


0-8 


4237 


1-8 


0719 


2-8 


0051 


0-9 


-3681 


1-9 


0574 


2-9 


0037 


1-0 


3173 


2-0 


0455 


3-0 


-0027 
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BIBLIOGRAPHY ON SAMPLING 

The bibliography to the First Edition was drawn up by Mr. D. R. Read, 
and is reprinted here without change. It was based on a bibliography prepared 
by the Food and Agricultural Organisation of the United Nations. Additional 
references arranged as in the original bibliography will be found on pages 
14-433. The sections of the supplementary bibliographies are distinguished by 
dashes. 

The papers have been classified under the following heads : 

(A) Theory and methods. 

(B) Machine methods. 

(C) Population censuses. 

(D) Sociology, nutrition, health, etc. 

(E) Opinion surveys and market research. 

(F) Economics : surveys of industry, censuses of production, labour 
force, etc. 

(G) Agricultural economics and farm practice. 
(H) Crop estimation and forecasting, etc. 

(I) Forestry and land utilization surveys. 
(J) Estimation of wild populations. 

Since a single paper does not necessarily deal with only one subject, the 
subject classification must be taken as approximate only. A certain amount 
of general theory, for example, will be found in papers primarily dealing with 
special applications. In some instances where the original paper could not 
be consulted the classification has been made from the title and journal. Papers 
by the same author may be found under more than one heading, but papers 
by more than one author are indexed in the section concerned under the name 
of each author, so as to avoid difficulty in tracing all papers by a given author. 

BOOKS 

General 

BAEHNB, G. W. (1935). " Practical applications of the punched card method in 

colleges and universities." New York : Columbia University Press. 
BLANKENSHIP, A. (1943). " How to conduct consumer and opinion research " 

(2nd. edn., 1945). New York : Harpers. 

CANTRIL, H. (1944). " Gauging public opinion." Princeton University Press. 
CHURCHMAN, C. W., ACKOFF, R. L., and WAX, M. (1947). " Measurement of consumer 

interest. 1 ' Philadelphia : University of Philadelphia Press. 
FISHER, R. A. (1925). " Statistical methods for research workers " (10th edn., 1946). 

Edinburgh : Oliver & Boyd. 
(1935). "The design of experiments" (4th edn., 1947). Edinburgh: 

Oliver & Boyd. 
HARTKEMEIER, H. P. (1942). "Principles of punch-card machine operation.' ' 

New York : Thomas Y. Crowell. 
KENDALL, M. G. (1943), and STUART, A. ""The advanced theory of statistics," Vol. I, 

1958. Vol. II, 1961. London: Griffin. 

PEATMAN, J. G. (1947). " Descriptive and sampling statistics. " New York: Harpers. 
RHODES, E. C. (1933). " Elementary statistical methods " (8th edn., 1948). London: 

Routledge. 
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SCHUMACHER, F. X., and CHAPMAN, R. A. (1942). " Sam P^S ^e^^ 8 "* forestr y 



" Mtthodes statistiques modernes des Administrations Federates 

S kET* G!' "An introduction to the theory of 
statistics" (14th edn., 1950, 1958). London: Griffin. 

Reports on Surveys, etc. 

BOWLEY, A. L. (1930-1935). " New survey of London life and labour/' Vols. I-IX. 
JONES? DCARADOG (1934). " The social survey of Merseyside." London: Hodder & 
(1901). " Poverty : a study of town life " (4th edn., 1908). 
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work of the University of Bristol Social Survey." Bristol : Arrowsrnith. 
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Printing Office, Washington, D.C. 

Tables 
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and medical research" (3rd edn., 1948). Edinburgh: Oliver & Boyd. 

PAPERS 

A. THEORY AND METHODS 
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sections are not separately indexed. 



Abstraction of data, 109, 123. 

Accuracy, 32, 143, 145, 247. 

Address, 370; modification of, 374. 

Addressograph list, 82. 

Administrative areas, sampling by, 36, 75. 

Administrative organization for surveys, 
102. 

Advertising, 79. 

Advisory Economists, 59. 

Aerial photographs, 35, 42, 74, 86. 

Agricultural Meteorological Committee, 
13. 

Agriculture, frames for, 81-87. 

Allied Mission for Observing Greek Elec- 
tion, 71. 

Alternative estimates, 145. 

Analysis of results, 108141 ; methods, 
109; by Cope-Chat cards, 110; by 
punched cards, 112123; sampling for, 
128, 364; critical, 109, 131, 308-332, 
394; of two-way tables, 131-141, 394; 
errors in, 124, 397; speed of, 396. 

Analysis of variance, 205; interpretation 
of, 266; applications of, 218, 250, 254, 
269, 280, 281, 350, 361; reporting, 143; 
examples of, 208, 210, 226, 228, 252, 273. 

Animal populations, 44. 

Area sampling, 68-79, 82-87; see also 
systematic sample from areas. 

Areas, measurement by sampling, see 
point and line sampling. 

Areas, selection with equal probability, 82. 

Attenuation, 322. 

Attributes, see qualitative variates. 

Autocode, 376. 

Automatic recording, 355. 



Balanced differences, 231. 

Balanced sample, 39, 174, 221. 

Bias, 9-17, 143; permissible, 17; 
estimation of, 239, 345; in selection, 
9-15, 65, 80, 84, 165, 222, 240, 367; in 
demarcation of units, 15, 164; in eye 
estimates, 43, 88, 163, 165, 222; in 
estimation, 16, 73, 77, 145, 162, 174; in 
estimate of error, 198; relative precision 
of biased and unbiased estimates, 344. 



Bibliography, 401-433. 

Binary scale, 371. 

Biological sampling, 48, 236. 

Bit, 371. 

Blocks, city, 68, 71. 

Blocks, randomized, 105. 

Blythe, R. H., 71. 

B modification, 374. 

Box, K., 55. 

Boyd, D. A., 81, 322. 

British Institute of Public Opinion, 299. 

British Tabulating Machine Co., see 

punched cards. 

Bureau of Agricultural Economics, 73. 
Bureau of the Census, 73. 



Calcutta Institute of Statistics, 83. 

Calibration of eye estimates, 43, 88, 165, 
222. 

Canada, population census, 308; labour 
force survey, 333, 343, 358. 

Cancer, knowledge of, 315. 

Cards, 104, 109; see also Cope-Chat cards 
and punched cards. 

Causal relationship, 131, 188. 

Cells, 41. 

Census of Woodlands, 16, 44, 46, 83, 99, 
163, 220, 232, 238, 242, 257, 288. 

Central Office of Information, 78. 

Change, see successive occasions. 

Checks on field work, 106; on computa- 
tions, 124; by comparison with com- 
plete returns, 31, 144. 

Chi-squared test, 22, 28, 200, 

Cluster sampling, 20. 

Cochran, W. G., 270, 273. 

Coding, 109, 110, 111, 118-123, 126, 391. 

Coefficient of variation, 184. 

Collation, 370, 

Collator, 118. 

Collective characteristics, 122. 

Commercial undertakings, 79. 

Comparability, 50. 

Complete census, 3; combination with 
sample, 47. 

Composite sampling scheme, 46. 

Compulsory returns, 59. 
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Computations, preliminary, 108, 117, 123; 

checks on, 124; see also analysis. 
Constants, fitting of, 137, 201. 
Consumer preferences, 79. 
Control, machine, 114. 
Control characters, 40. 
Cope-Chat cards, 104, 109, 110. 
Corlett, T., 349. 
Correlation coefficient, 176, 199. 
Costs, 143; minimization of, 283-296, 

339, 352. 
Counter, 113. 
Counts, in analysis, 109, 110, 111, 113, 115, 

376; in electronic computation, 375. 
Covariance, 198. 
Coverage, 49, 141. 
Cox, G., 138. 
Crockery breakages, 55. 
Crop estimation, 42, 87-92; areas, 168, 

224, 272, 286, 289, 290; yields per acre, 

165, 168, 222, 224, 264, 273, 286, 289, 

291,322; sampling of standing crop, 13, 

15, 90, 99, 322. 
Crop forecasting, 92. 
Cross footing multiplying punch, 117. 
Crowds, 43. 
Cruising, 42, 90. 



Dalenius, T., 386. 

Deductions from surveys, qualifications of, 

131, 212. 

Deep stratification, 358. 
Defective sample, adjustment for, 129. 
Degrees of freedom, 185, 206. 
Deming, W. E., 65, 71, 127. 
Description of survey, 141. 
Distributors, 114. 
Doering, C. R., 315. 
Domains of study, 24, 28; estimates for, 

146, 297; errors for, 146, 202, 210, 298- 

305. 

Double-ratio estimate, 343. 
Drum, magnetic, 372. 
Duplicate samples, 242. 
Duplication in frame, 60. 
Durant> H., 299. 
Dwellings, 67, 74. 
Dyke, G. V., 315, 322. 



Eckler, A. R., 77. 

Economic institutions, frames for, 79. 

Editing of data, 392. 

Efficiency, 109, 144, 145, 200, 246-296, 

353; definition, 247; see also under 

types of sample. 
Efficient estimate, 247. 



Election forecasts, 80. 

Electoral lists, 66, 71. 

Electronic computers, 334, 370-398; 
general description, 370 ; types of store, 
371; input and output, 372; program- 
ming, 373; sub-routines, 375; auto- 
codes and interpretative routines, 376; 
uses for survey analysis, 376-398. 

Employment, 76. 

End corrections, 175. 

England and Wales, see Census of Wood- 
lands, National Farm Survey, Survey of 
Fertilizer Practice. 

Equipment, 142. 

Error graph, 235. 

Error, limits of, 191, 236. 

Error, sampling; see random sampling 
error and bias. 

Errors, in observation and measurement, 
15, 106, 354; in fieldwork, 106; in 
computations, 124; rounding off, 238; 
gross, 124, 354; grouping, 238; see also 
investigators, tests of. 

Estimates, alternative, 145. 

Estimation of population values, 145-182; 
rules for, 147; of sampling errors, 
1 83-245 ; of size of sample and relative 
efficiency, 94-99, 246-296; see also 
under types of sample. 

Experiments, 105, 131, 212. 

Explanatory notes (census forms), 103. 

Exploratory surveys, 48, 99. 

Eye estimates, calibration of, 43, 88, 165 
222. 



Factorial design, 105. 

Factories, 79. 

Factors, 131, 308. 

Families, 121, 217. 

Family Census (U.K.), 53, 64, 130. 

Family income, 327 ; Norfolk-Portsmouth, 
96, 186, 239. 

Family size, 308. 

Farms, 74, 81, 273, 288, see also Hertford- 
shire farms. 

Fertilizer Practice, see Survey of. 

Fiducial probability, 191, 236. 

Field, of punched card, 113. 

Field work, organization of, 102106; 
control of accuracy, 106. 

Fields, 81, 273, 288, 344, 347. 

Finite sampling, corrections for, 187, 246. 

Finney, D. J., 237, 314. 

Fisher, R. A., 105, 201, 267. 

Fitting constants, 137, 201. 

Fixed sample, 45. 

Flats, 68. 

Floating point arithmetic, 371. 
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Flow diagram, 375. 

Follow up, see non- response. 

Food offices, 64. 

Forecasts, 80, 92. 

Forestry, frames for, 83-87, see also 
Census of Woodlands. 

Forms, 57, 103-105. 

Frame, 20, 60-87, 144; defects of, 60, 
144; human populations, 62-78; 
economic institutions, 79; agriculture, 
81-87; forestry, 83-87; construction 
of second-stage, 34, 68, 71; from 
censuses, 65; from lists, 63, 66, 70, 
79; from maps, 68-75, 79, 81-86; 
from aerial photographs, 86. 

Frankel, L. R., 77. 

Functions, errors of, 196. 



Galvani, L., 40. 

Gamma function, 293. 

Gang-punching, 116. 

Geoffrey, L., 127. 

Geographical scope, 49, 141. 

Gini, C., 40. 

Glass, D. V., 130. 

Gray, P. S., 349. 

Greece, population census, 71. 

Gregory, W., 299. 

Gross errors, 124, 354. 

Group information, 383. 

Grouping, 118, 186, 188, 238, 379. 



Half-open interval, 67, 68. 

Hansen, M. H-, 36, 65, 352, 358. 

Haphazard selections, 10. 

Healy, M. J. R., 355. 

Hertfordshire farms, samples for wheat 
acreage, 30, 36; random sample, 97, 
152, 159, 161, 162, 189, 205, 214, 216, 
220, 252, 258; stratified sample, 150, 
203, 249, 251; variable sampling 
fraction, 154, 207, 252, 255, 256, 340, 
351; probability proportional to size, 
351; samples of parishes, 169, 226, 264, 
266; two-stage samples, 270, 271, 290. 

Hollerith, see punched cards. 

Households, 78, 121, 217, 365, 383. 

Houses, percentages defective, 149, 195. 

Housing conditions, relation to health and 
income, 327. 

Hurwitz, W. N., 36, 352, 358. 



I.B.M., see punched cards. 
Inaccuracy, in frame, 60. 
Inadequacy, in frame, 60, 61, 78. 



Incomplete census, 2. 

Incomplete results, 10; adjustment for, 

129, 337, 381, 382, 388. 
Incompleteness, in frame, 60. 
Independence, 196. 
Independent samples, 45. 
Index numbers, calculation of, 117. 
India, Calcutta Institute of Statistics, 83. 
Industrial undertakings, (48, 59, 79). 
Information, description of, 141 ; required, 

51; methods of collection, 57; 

practicability, 55. 
Insect populations, 44. 
Instructions, 103, 141. 
Integral values of supplementary variate, 

217. 

Interactions, 105, 140, 211, 311. 
Interpenetrating samples, 44, 105, 107, 

143,241,242; examples, 83. 
Interpretative routine, 376. 
Inter-relations between units, 54. 
Intra-class correlation, 267. 
Investigators, 58, 105; tests of, 44, 99, 

105, 107, 143, 241 ; instructions to, 103; 

conditions of work, 106. 
Iowa State College Statistical Laboratory, 

73. 
Italy, population census, 40. 



Jamaica, agricultural survey, 356. 
Jessen, R. J., 71, 73. 
Jump order, 372. 



Kempthome, CX, 71, 117. 
Keyfitz, N., 308, 333, 343, 358. 
King, A. J., 73. 
King, G.W., 355. 
Kirby, J., 327. 
Kiser, C. V., 11. 
Kish, L., 358. 
Kraals, 159, 214. 



Land, utilization surveys, 86. 

Latin square, 350. 

Lattice sampling, 356, 363. 

Least squares, 137, 201. 

Limits of error, 191, 236. 

Line sampling, 42, 85, 86; errors, 229 

examples, 232. 

Linear functions, standard errors o i 196. 
Listing, 114. 
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Lists, see frame and systematic sample. 

Livestock, 81. 

Localized population survey, 75. 

Logical order, 370. 

Logits, 314, 395. 

Lombard, H. L., 315. 

London School of Economics, 308, 343. 

Loop, 375. 

Losses due to errors, 292, 339. 



M'Gonigle, G. C. M., 327. 

Madow, W., 358. 

Magnetic core store, 371. 

Magnetic drum, 372. 

Magnetic tape, 372. 

Mahalanobis, P. C., 83, 92. 

Maps, use as frames, 68-75, 79, 81-86; 

areas from, see point and line sampling. 
Marginal categories, 43. 
Marginal totals, 376. 
Mark sensing, 109, 333. 
Market research, 79. 
Master cards, 116. 
Master sample, 65, 73, 75. 
Mathison, L, 81. 
Matrix inversion, 321. 
Mean, arithmetic, 145; rule for estimation 

of, 148; geometric, 145; working, 185; 

correction for, 185. 
Mean square, 206. 
Mean square deviation, 183. 
Measurement, errors in, 15. 
Mechanical editing, 333. 
Median, 145. 
Milk, composition of, 178, 181, 234, 235, 

262. 

Milk fever, 395. 
Ministry of Agriculture, 82, 128, 158, 322; 

crop reporters, 324. 
Ministry of Home Security, 67. 
Morbidity, 11. 
Moving observer, 43. 
Multi-phase sample, 38; estimates, 157, 

159, 162; errors, -213, 219, 258; size 

and efficiency, 258, 286, 335; on 

electronic computer, 386. 
Multi-stage sample, 18, 34; estimates, 

170, 17-1; errors, 226; size and 

efficiency, 98, 268, 285; in lattice 

sampling, 362 ; on electronic computer, 

385; examples, 71, 77, 81, 84. 
Multi-stage sample with uniform overall 

sampling fraction, 36, 148, 171, 278, 

287, 349, 350; examples, 71, 77; 

with adjustment of proportions of 

second-stage units, 78. 
Multiple classification, analysis of, 131-141, 

308-317, 394. 



Multiple punching, 119. 
Multiple stratification, 25, 254; examples, 
73, 77 ; without control of sub-strata, 25. 
Multiplying punch, 117. 



National Agricultural Advisory Service. 

59, 324. 
National Farm Survey, 115, 117, 128, 158, 

301, 305, 306, 307, 353, 390. 
National Register, 64. 
Natural units, 20; hierarchy of, 121. 
Non-response, 59, 107, 130; sub-sample 

for, 108. 

Norfolk-Portsmouth, Virginia, 186. 
Normal distribution, 190; sample from, 

149, 185, 188, 190, 192, 298. 
Normal equations, 320, 
Normal law of error, 190. 
Notation, 7, 146. 
Nuffield Trust, 78. 



Observation, errors of, 15. 
Observers, see investigators. 
Odds, 314. 

Opinion surveys, 79; effect of stratifica- 
tion, 248. 

Optimal allocation, 18, 28, 285, 338. 
Optimal programming, 372. 
Optimal values, 284. 
Order, 370. 
Order code, 371. 
Ordnance Survey, 75, 82, 83, 84. 
Orthogonal polynomials, 321. 
Orthogonality, 211, 
Out-of-date frame, 60. 
Overall estimates, 175. 



Partial replacement, see successive 

occasions. 

Patterson, H. D., 179, 180, 315, 358. 
Percentage standard deviation, 96, 184. 
Percentage standard error, 95, 184. 
Percentages, choice of, 108; calculation 

of, 117; estimation of, 148; standard 

error of, 94, 193. 
Personnel, 142. 
Phase, see multi-phase. 
Pilot surveys, 48, 99, 273. 
Planning of surveys, 48-101, 246, 294. 
Point sampling, 35, 69, 82, 86; estimates, 

167; errors, 224; size and efficiency, 

262, 286; examples, 272. 
Pooled estimate of error, 205, 236. 
Pooling of classes, 137. 
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Population, human, 121, 217; frames for, 

62-78; localized surveys of, 75-78; 

special classes, 78. 
Population, statistical, 20; finite, 17; 

to be covered, 49, 141; values, see 

estimation. 
Population census, Greece, 71; Italy, 4U; 

Southern Rhodesia, 159, 214; U.K., 53; 

U.S.A., 65; frame from, 65. 
Postal enquiry, 58, 107, 130. 
Potato survey, 131-141, 199, 209, 322, 376. 
Powers-Samas, see punched cards. 
Precision, relative, 246-283, 305, 353; 

definition, 247. 
Preceding, 120. 
Preliminary computations, see computa- 

Preliminary count (of dwellings), 68. 

Preliminary estimates, 85, 92. 

Printing, 113. 

Probability of selection proportional to 
size, sample with, 35, 36; estimates, 
167, 169; errors, 224, 225; size and 
efficiency, 262; selection of, 35, 347; 
examples, 36, 71, 77; in multi-stage 
sampling, see multi-stage sample with 
uniform overall sampling fraction; see 
also point sampling and variable 
probability. 

Probits, 314. 

Programming, 373; optimal, 372. 

Progressive digiting, 115. 

Progressive totals, 115. 

Proportions, 303; see also percentages. 

Public opinion polls, 79, 299. 

Punched cards, 109, 112-123, 126, 333, 
355, 372. 

Punching, 109, 112, 120, 126. 

Purpose of survey, 49, 51, 141. 

Purposive selection, 40, 80, 142. 



Qualitative variates, 94, 193, 232, 248, 

314, 379, 394; rule for, 148. 
Quality control, 48. 
Quenouille, M. H., 319, 355. 
Questionnaires, 52-59, 103-105; tests of, 

99, 104, 105; postal, 58, 107. 
Questions, wording of, 103. 
Quota method, 80, 142, 299. 



Rabbit damage, 344. 

Raising factor, 147; overall, 170. 

Random numbers, 21, 297. 

Random sample, 10, 21; estimates, 145, 
148, 152, 159, 162, 297; errors, 183-196, 
212, 217, 218, 297; size and efficiency, 
94, 248, 249, 256, 353; examples, 31, 83. 



Random sampling error, 2, 9, 17; 
estimation of, 183-245, 387; by 
sampling, 238; from duplicate samples, 
242 ; presentation of, 243 ; see also under 
types of sample. 

Random selection, 21 ; examples, 22. 

Random selection from areas, 22. 

Randomized blocks, 105. 

Rating offices, 66. 

Ration books, 64. 

Ratios, rule for estimation of, 148; 
standard error of, 198, 212; calculation 
of, 117, 385; use in investigational 
work, 317; use in detection of gross 
errors, 354; see also supplementary 
information. 

Read, D. R., 373. 

Rees, D. H., 386. 

Regression, 155, 199, 219, 313, 385; 
multiple, 320, 327; curvilinear, 321; 
grouped data, 319; effect of random 
errors, 322 ; use in investigational work, 
3 1 7-332 ; use in detection of gross errors, 
354, 392; see also supplementary 
information and calibration of eye 
estimates. 

Rejection of observations, 354. 

Rents, 158. 

Repeated surveys, 17, 79; see also 
successive occasions. 

Reports, 141. 

Representative sample, 9, 84. 

Reproducing punch, 116. 

Response, failure of, see non-response. 

Road network, sampling of, 363. 

Road transport, 334, 362. 

Robinson, H. L., 333, 358. 

Rolling total tabulator, 114. 

Rotational sampling schemes, 36. 

Rounding off, 118, 238. 

Routine, 373; sub-, 375; interpretative, 
376. 

Rowntree > B. Seebohm, 4, 

Royal Commission on Population, 64, 130. 

Sample, types of, 20-47; see also under 
separate types. 

Sample census, 2. 

Sample survey, 4. 

Sampling error, see random sampling error 
and bias. 

Sampling fraction, 18, 23, 24, 147, 148. 

Sampling process, 1; in censuses and 
surveys, 2-6; census, incomplete 
sample, 2; survey, sample, 4. 

Sampling units, 20; choice of, 19; multi- 
stage, 34; inter-relations between, 54; 
variation in size of, 19, 98, 279; rule 
for estimation of number in population, 
147. 
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Scoring, 147, 148, 195. 

Selection, methods of, 10, 21, 29, 142, 334. 

Selection with probability proportional to 
size, see probability of selection. 

Shaul, J. R. H., 160. 

Sheppard's correction, 239. 

Shifting, 370. 

Shops, 79. 

Sickness, 12. 

Simpson, H. R., 423. 

Size, probability proportional to, see 
probability of selection. 

Size of sample, determination of, 94 111, 
246-296; see also under types of sample. 

Size of strata, variation in, 98, 280. 

Size of unit, effect on sampling error, 19, 
98; variation in, 99, 279. 

Snedecor, G., 105, 138. 

Social survey, 78. 

Soil analysis, 82. 

Soil temperatures, 282. 

Solids-not-fat, see milk. 

Sorter, 113. 

Sorter-counter, 113. 

Sorting, 110, 391. 

Southern Rhodesia, 159, 214. 

Spence, J. C., 327. 

Stage, see multi-stage. 

Standard deviation, 96, 183, 190; per- 
centage, 96, 184. 

Standard error, 32, 94, 183; percentage, 

95, 184; of qualitative variates, 94, 
193; of mean, 96, 184, 187; of total, 

96, 184, 187; of ratio, 198, 212; of 
multiple, 196; of product, 198; of 
sum, 197; of difference, 196; of linear 
function, 196; of weighted mean, 197; 
of standard deviation, 1 92 ; effect of lack 
of independence, 198; see also under 
types of sample. 

Standardization, 157, 158, 159, 162, 213, 

219, 318. 

Statistical analysis, see analysis. 
Statistician, functions of, 6, 49. 
Stephan, F. F., 65. 
Stevens, W. L., 138. 
Stock, J. S., 77. 
Stones, sampling of, 12. 
Store, 370, 371. 
Stratification after selection, 25, 32, 152, 

205. 
Stratification, multiple, see multiple 

stratification. 
Stratified sample, variation in size of strata, 

98, 280. 
Stratified sample with one unit per 

stratum, 24, 78, 280. 
Stratified sample with uniform sampling 

fraction, 17, 23, 146; estimates, 150, 

160, 164; errors, 201, 205, 215, 221, 

300, 303, 387; size and efficiency, 98, 



248, 249, 256, 305, 353; selection of, 
334; incomplete results, 337; examples 
31, 37. ' 

Stratified sample with variable sampling 
fraction, see variable sampling fraction. 

Streets, sampling by, 67, 69. 

Sub- Commission on Statistical Sampling 
141. * *' 

Sub-routine, 375. 

Sub-sample for non-response, 60; on 
successive occasions, 45; see also 
multi-phase sample. 

Sub- totals, 115. 

Substitution, 10, 108. 

Successive occasions, sampling on, 17, 45; 
estimates, 175, 179; errors, 233; size 
and efficiency, 260; on electronic 
computer, 386; example, 77. 

Sukhatme, P. V., 15, 255. 

Sum of squares, 185, 206; calculation of. 
185. 

Summary punch, 116. 

Supervision, 19, 105. 

Supplementary information, 18, 32, 38, 
98, 145, 146; ratio method, 71, 155-162 
171-174, 198, 212-218, 256; double- 
ratio method, 343; regression method, 
155, 162-165, 171, 218-222, 256; 
effects of errors in, 213. 

Survey, definition of, 4. 

Survey of Fertilizer Practice, 57, 81, 111, 
123, 171, 227, 240, 257, 264, 291, 295, 
346, 383. 

Syracuse, U.S.A., 11. 

Systematic sample from areas, 10, 41; 
estimates, 174; errors, 229; size and 
efficiency, 282; examples, 83. 

Systematic sample from a list or card 
index, 10, 29; estimates, 174; errors, 
229, 366; selection of, 334; examples, 
64, 65, 67, 81. 



t distribution, 192. 

Tabulation, machine, 114, 376-385. 

Tabulator, 113. 

Tape, paper, 372; magnetic, 372. 

Telephone enquiries, 80, 

Teleprinter, 372. 

Temperatures, soil, 282. 

Tepping, E. J., 127, 358. 

Terminology, 7. 

Test order, 370. 

Tests of questionnaires, 99, 104, 105; of 

investigators, 99, 105, 241; of 

significance, 188, 200. 
Thomas, G., 55. 

Timber, see Census of Woodlands. 
Totals, computation of, 109, 110, 111, 113, 

376; rule for estimation of, 147. 
Tracks, 70. 
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Trailers, 116, 120, 385. 

Training, 99, 105. 

Transformations, 236, 314, 394. 

Travelling, 19. 

Two-nine feature, 120. 

Two-way tables, analysis of, 131, 394. 



U.S.A. employment estimates, 76; master 
sample, 73; population census, 65; presi- 
dential election, 80. 

Undeveloped areas, 34, 42, 296; frames 
for, 70, 85-87. 

Unemployment, 76. 

Uniform sampling fraction, 23; overall, 
see multi-stage sample with uniform 
overall sampling fraction. 

Unit, natural, see natural units. 

Unit, sampling, see sampling units. 

United Kingdom, effects of air raids, 67; 
Family Census, 53, 64, 130; localized 
surveys, 76; Population Census, 53, 365. 

United Nations, 141; Food and Agri- 
culture Organization, 373; Economic 
Commission for Europe, 363, 

Variable probability, sampling with, 352. 
Variable sampling fraction, sample with, 

18, 28; estimates, 153, 161, 164; 

errors, 201, 2u5, 216, 221, 300, 303; 

size and efficiency, 98, 254, 256, 305, 



350; optimal allocation, 18, 28, 285, 
338; selection of, 334; on electronic 
computer, 382; examples, 31, 71, 81 
115, 128, 171. 

Variance, 183; unequal, 201, 207. 

Variate, 146. 

Variation, coefficient of, 184. 

Vehicles, sample of, 334. 

Villages, 70. 



Weighted mean, 17; of sub-class means, 
134; of differences of sub-class means, 
136; standard error of, 197. 

Weighting factors, 108, 123. 

Wheat, 13, 15, 166, 223, 273, 344, 347; 
see also Hertfordshire farms. 

Wireworms, 236. 

Word, 370. 

World statistics, 51. 



Yates, F., 12, 13, 81, 138, 211, 237, 268 

283. 
Yield per acre, see crop estimation an 

crop forecasting; bias in, 16. 



z transformation, 314, 
Zacapanay, L, 268, 
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